Target release 
Version

0.2.0

Document status

Document owner
Designer
Developers
QAMichelle Beard


Goals

Background and strategic fit

 

Assumptions

 

Requirements

#TitleUser StoryImportanceNotes
1Accommodate arbitrary data schemas (not just UserALE): allows us to expand to cover arbitrarily many data sources without needing to write new methods for Distill   
2Distill is primarily designed to support UserALE.js
  • Distill may be modified to support other time-dependent, sequential event data with meta-data
  
2

Distill will support json datastores for datasource/query

  • 0.2.0 will only support Elastic 5.6.3+
  
3Installable as a Python library (via pip install)  
4Consumable as a RESTful web service  
5Provide a set of built-in data queries for common use cases  
6Allow users to make custom queries not covered by the built-in queries  
7Provide a set of built-in data transformations for common use cases  
8Thoroughly document the API using Sphinx so that users can extend Distill functionality to custom data transformations, analytics, schemas, etc.  
9Support Windows, Linux, and Mac users with OS-specific eggs/wheels  
10End-to-end encryption  
11Provide convenience libraries with pre-built data schemas for UserALE and other data streams  

Include any mockups, diagrams or visual designs relating to these requirements.

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome
  1. How do we accommodate different data schema that allow for multiple data stream?
See requirement #1

2. Does Distill require a specific backend (Elastic) or can it go to Solr/Lucene

Underlying data store needs to support key value pairs
3. How do we support Windows Users?

See requirement #9

4. How do we provide the "average" data scientist enough out of the box packages, modules to be minimally viable out of the box? 
5. Roadmap for supporting packages and Anaconda distribution 
6. Migrate to Django from Flask? 
7. Is Distill simple python, or does it run as a service (or on a webservice) by design?Both, see requirements #3 & #4
8. Does Distill manage scale in its connections to other datastores, or does it rely soley on Lucene based services (Elastic)?Distill's querying is dependent on how well Elasticsearch scales on query.
9. Does Distill remain tethered outright to Elastic?See requirement #2
10. TLS or SSL: Modern vs. Legacy network support. 

Analytics & Processing Examples

These examples are here for drawing out higher-level goals for Distill's functionality. This section can be removed once the goals have been solidified.

 

Here is a model data pipeline for Apache SensSoft: RAW DATA>QUERY>FILTER/Q&A>TRANSFORMATION>PRIMITIVE FEATURE EXTRACTION>TRAINED MODELING>DERIVED FEATURE EXTRACTION

There are a few different classes of libraries that Distill might include in support of this pipeline; they have different consequences for workflows with in larger analytic pipelines.

  1. QUERY: We may want to be able to recreate previous queries used for other analyses, not necessarily "save" queries.

  2. FILTERING/Q&A: Elimination of data from query return, when that data can't be eliminated by query alone because some pattern to be filtered is fully nested within some query index.
    1. EX: Filter out specific save events from osquery object access data that do not coincide with click/keyboard activity with KM Logger.
    2. EX: Random resampling of km-logger events time-series–random sample every 1/min interval

  3. TRANSFORMATION: Native format of Lucene-like DataStores is a list of records, called as JSON through querie
    1. EX: Query data from one or more data sources (bunch of JSON), impose structure on JSON so that we represent as list object of logs ordered by timestamp (TS)

  4. PRIMITIVE FEATURE EXTRACTION
    1. EX: Query for UserALE.js data and type=="click", then by userId, then aggregate across some time interval (e.g., count) by unique userId (within-user), return 
    2. EX: Query for UserALE.js data and type=="click", then by userId, then aggregate over logs by unique user Id (e.g., count, mean, media, mode, variance, range) (between-user)
    3. EX: Using count data (EX a), bin by "path", create probabiliy of clicking on X path.

  5. TRAINED MODELING
    1. EX: Call or recreate PRIMITIVE FEATURES, then feed features to Graph Methods, NN or HMM, etc. (see this paper), return model params to Python Env.
    2. EX: Build a simple directed graph (like bowie http://senssoft.incubator.apache.org/) that shows stochastic relationships between elements, or pages, return model params to Python Env. as well as in/out degree, centrality metrics.

  6. DERIVED FEATURE EXTRACTION
    1. EX: Call or recreate 5, extract model features as in 4, return to Python Env.

 

Not Doing