You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

Target release 
Version

0.2.0

Document status

DRAFT

Document owner
Designer
Developers
QAMichelle Beard

Goals

  • Allow for scale-ability in analytics framework for SENSSOFT
    • Distill 0.2.0 will allow us to grow the incumbent analytical/modeling capability of Distill, including:
      • Pre-packaged preprocessing methods for filtering, sequencing and packaging time-series event data, with meta, as portable python Dictionaries
      • Pre-packaged graph and time-series modeling methods
      • Limited packaging of statistical and data processing python packages (e.g., NumPy, SciPy, Pandas, etc.)
  • Allow for customizable user generated python content within Distill
    • Distill 0.2.0 will allow users to generate their own libraries for Distill, that implement different pre-packaged functions, as well as user-generated ones.
    • Distill 0.2.0 will enforce a predictable library structure, making it a less error-prone process, and reducing support burden for Distill
  • Allow for processed user log data portability to different environments (e.g., visualization, other analytic environments, i.e., anaconda)
    • Distill 0.2.0 will generate a predictable output structure, making it easier for developing readers into different environments
    • Distill 0.2.0 will generate predictable outputs that can be consumed by other services, applications
    • Distill 0.2.0 output will be parsible within other python batch scripts, scripts, or IDEs
  • Allow for reduced-to practice methods allowing UserALE.js data to be useful for a variety of use-cases:
    • Automation: other processes, applications are able to consume processed Distill data to drive logic or analytic-bases processes
      • Example: A web application changes the rate at which messages are pushed to users based on what and how users are making use of application features.
      • Example: Distill is regularly called to generate updated model-derived features for secondary processing, such as Machine Learning.
    • Off-Line Analytics:  DataScientists, Analysts, and Statisticians of various levels of technical depth are able to quickly process user activity data and draw Distill output in to their analytics pipeline.
    • Visual Analytics: Visualization and Interactive data visualizations (e.g., TAP) can utilize Distill in configuring interactive features (menus, toggles, drop downs) to interact with user activity data and extract model parameters that are used for rendering returns through visualization.

Background and strategic fit

  • UserALE data is structurally rich, and while manageable, requires technical depth to process beyond simple aggregation
  • Distill 0.2.0 will make UserALE data usable by a wider range of professionals and academics with canonical offerings that can reduce pre-processing workload substantially for novices.
  • Distill 0.2.0 will make UserALE data useful for a wider range of use-cases that depend on deep expertise in working with time-series nested multi-dimensional, categorical and semantic data.
  • Distill 0.2.0 will provide the APACHE SENSSOFT community with more discrete, bounded programming projects to add to the code based with more immediate value to contributors. 
  • Distill 0.2.0 will provide APACHE SENSSOFT with substantial horizontal growth opportunities in order to grow the community base.

 

Assumptions

  • Distill will act as an abstraction layer around Elastic Search, and will not provide direct access to the Elastic Search api
  • When consumed as a python library, Distill will return python data objects (e.g. dictionaries)
  • When consumed as a web service, Distill will return json objects
  • Distill will never modify the source data (read operations only)
  • Distill will not provide any storage or caching for transformed data
  • Distill will only support Python 3.6
  • Distill will only support x64

 

Requirements

#TitleUser StoryImportanceNotes
1Accommodate arbitrary data schemas (not just UserALE) MUST HAVE 
2

Supported data sources:

  • ElasticSearch
 MUST HAVE 
3Installable as a Python library (via pip install) MUST HAVE 
4Consumable as a RESTful web service MUST HAVE 
5Provide a set of built-in data queries for common use cases MUST HAVE 
6Allow users to make custom queries not covered by the built-in queries MUST HAVE 
7Provide a set of built-in data transformations for common use cases MUST HAVE 
8Thoroughly document the API using Sphinx so that users can extend Distill functionality to custom data transformations, analytics, schemas, etc. MUST HAVE 
9Support Windows, Linux, and Mac users with OS-specific eggs/wheels MUST HAVE 
10End-to-end encryption MUST HAVE 
11Provide convenience libraries with pre-built data schemas for UserALE and other data streams SHOULD HAVE 

Questions

Below is a list of questions to be addressed as a result of this requirements document:

QuestionOutcome
  1. How do we accommodate different data schema that allow for multiple data stream?
 

2. Does Distill require a specific backend (Elastic) or can it go to Solr/Lucene

Underlying data store needs to support key value pairs
3. How do we support Windows Users?
  • Investigate whether we are using packages that don't build in Windows
  • Integrate testing across platforms
4. How do we provide the "average" data scientist enough out of the box packages, modules to be minimally viable out of the box? 
5. Roadmap for supporting packages and Anaconda distribution 
6. Migrate to Django from Flask? 
7. Is Distill simple python, or does it run as a service (or on a webservice) by design? 
8. Does Distill manage scale in its connections to other datastores, or does it rely soley on Lucene based services (Elastic)? 
9. Does Distill remain tethered outright to Elastic? 

Analytics & Processing Examples

These examples are here for drawing out higher-level goals for Distill's functionality. This section can be removed once the goals have been solidified.

  • Build intervals from matching sequences of raw events
  • Filter out unwanted events
    • Noisy/irrelevant events
      • May be conditional on neighboring events
    • "dangling" events (e.g. a stop event with no corresponding start)
  • Collapse duplicate events into a single event (when is this preferable to creating an interval?)
  • Create "sandwiches" (a set of events bookended by, e.g., a related start and stop event)
  • Replace some logs/data with other logs/data

Not Doing

  • We are NOT competing with Anaconda.
  • We are NOT supporting multiple versions of Python.
  • No labels