Page History

...

#

Title

User Story

Importance

Notes

1

Accommodate arbitrary data schemas (not just UserALE): allows us to expand to cover arbitrarily many data sources without needing to write new methods for Distill

Status

colour	Yellow
title	Should Have

2

Distill is primarily designed to support UserALE.js

Distill may be modified to support other time-dependent, sequential event data with meta-data

Status

colour	Red
title	Must Have

2

Distill will support json datastores for datasource/query

0.2.0 will only support Elastic 5.6.3+

Status

colour	Red
title	Must Have

3

Installable as a Python library (via pip install)

Status

colour	Red
title	Must Have

4

Consumable as a RESTful web service

Status

colour	Red
title	Must Have

5

Provide a set of built-in data queries for common use cases

Status

colour	Red
title	Must Have

6

Allow users to make custom queries not covered by the built-in queries

Status

colour	Red
title	Must Have

7

Provide a set of built-in data transformations for common use cases

Status

colour	Red
title	Must Have

8

Thoroughly document the API using Sphinx so that users can extend Distill functionality to custom data transformations, analytics, schemas, etc.

Status

colour	Red
title	Must Have

9

Support Windows, Linux, and Mac users with OS-specific eggs/wheels

Status

colour	Red
title	Must Have

10

End-to-end encryption

Status

colour	Red
title	Must Have

11

Provide convenience libraries with pre-built data schemas for UserALE and other data streams

Status


colour	Yellow
title	SHOULD HAVE

...

Analytics & Processing Examples

These examples are here for drawing out higher-level goals for Distill's functionality. This section can be removed once the goals have been solidified.

Here is a model data pipeline for SENSSOFT: RAW DATA>QUERY>FILTER/Q&A>TRANSFORMATION>PRIMITIVE FEATURE EXTRACTION>TRAINED MODELING>DERIVED FEATURE EXTRACTION

There are a few different classes of libraries that Distill might include in support of this pipeline; they have different consequences for workflows with in larger analytic pipelines.

...

QUERY: We may want to be able to recreate previous queries used for other analyses, not necessarily "save" queries.

FILTERING/Q&A: Elimination of data from query return, when that data can't be eliminated by query alone because some pattern to be filtered is fully nested within some query index.

EX: Filter out specific save events from osquery object access data that do not coincide with click/keyboard activity with KM Logger.
EX: Random resampling of km-logger events time-series–random sample every 1/min interval

TRANSFORMATION: Native format of Lucene-like DataStores is a list of records, called as JSON through querie

EX: Query data from one or more data sources (bunch of JSON), impose structure on JSON so that we represent as list object of logs ordered by timestamp (TS)

PRIMITIVE FEATURE EXTRACTION

EX: Query for UserALE.js data and type=="click", then by userId, then aggregate across some time interval (e.g., count) by unique userId (within-user), return
EX: Query for UserALE.js data and type=="click", then by userId, then aggregate over logs by unique user Id (e.g., count, mean, media, mode, variance, range) (between-user)
EX: Using count data (EX a), bin by "path", create probabiliy of clicking on X path.

TRAINED MODELING

EX: Call or recreate PRIMITIVE FEATURES, then feed features to Graph Methods, NN or HMM, etc. (see this paper), return model params to Python Env.

DERIVED FEATURE EXTRACTION

EX: Call or recreate 5, extract model features as in 4, return to Python Env.

Build intervals from matching sequences of raw events
Filter out unwanted events
- Noisy/irrelevant events
  - May be conditional on neighboring events
- "dangling" events (e.g. a stop event with no corresponding start)
Collapse duplicate events into a single event (when is this preferable to creating an interval?)
Create "sandwiches" (a set of events bookended by, e.g., a related start and stop event)
Replace some logs/data with other logs/data

...

Page tree

Versions Compared

Old Version 41

New Version 42

Key

Analytics & Processing Examples