Goals
- Allow for scale-ability in analytics framework for Apache SensSoft
- Distill 0.2.0 will allow us to grow the incumbent analytical/modeling capability of Distill, including:
- Pre-packaged preprocessing methods for filtering, sequencing and packaging time-series event data, with meta, as portable python Dictionaries
- Pre-packaged graph and time-series modeling methods
- Limited packaging of statistical and data processing python packages (e.g., NumPy, SciPy, Pandas, etc.)
- Distill 0.2.0 will allow us to grow the incumbent analytical/modeling capability of Distill, including:
- Allow for customizable user generated python content within Distill
- Distill 0.2.0 will allow users to generate their own libraries for Distill, that implement different pre-packaged functions, as well as user-generated ones.
- Distill 0.2.0 will enforce a predictable library structure, making it a less error-prone process, and reducing support burden for Distill
- Allow for processed user log data portability to different environments (e.g., visualization, other analytic environments, i.e., anaconda)
- Distill 0.2.0 will generate a predictable output structure, making it easier for developing readers into different environments
- Distill 0.2.0 will generate predictable outputs that can be consumed by other services, applications
- Distill 0.2.0 output will be parsible within other python batch scripts, scripts, or IDEs
- Allow for reduced-to practice methods allowing User Activity data from Software Environments to be useful for a variety of use-cases:
- Automation: other processes, applications are able to consume processed Distill data to drive logic or analytic-bases processes
- Example: A web application changes the rate at which messages are pushed to users based on what and how users are making use of application features.
- Example: Distill is regularly called to generate updated model-derived features for secondary processing, such as Machine Learning.
- Off-Line Analytics: DataScientists, Analysts, and Statisticians of various levels of technical depth are able to quickly process user activity data and draw Distill output in to their analytics pipeline.
- Visual Analytics: Visualization and Interactive data visualizations (e.g., TAP) can utilize Distill in configuring interactive features (menus, toggles, drop downs) to interact with user activity data and extract model parameters that are used for rendering returns through visualization.
- Automation: other processes, applications are able to consume processed Distill data to drive logic or analytic-bases processes
Background and strategic fit
- UserALE data is structurally rich, and while manageable, requires technical depth to process beyond simple aggregation
- Distill 0.2.0 will make UserALE data usable by a wider range of professionals and academics with canonical offerings that can reduce pre-processing workload substantially for novices.
- Distill 0.2.0 will make UserALE data useful for a wider range of use-cases that depend on deep expertise in working with time-series nested multi-dimensional, categorical and semantic data.
- Distill 0.2.0 will provide the Apache SensSoft community with more discrete, bounded programming projects to add to the code based with more immediate value to contributors.
- Distill 0.2.0 will provide Apache SensSoft with substantial horizontal growth opportunities in order to grow the community base.
Assumptions
- Distill will act as an abstraction layer around Elastic Search, and will not provide direct access to the Elastic Search api
- When consumed as a python library, Distill will return python data objects (e.g. dictionaries)
- When consumed as a web service, Distill will return json objects
- Distill will never modify the source data (read operations only)
- Distill will not provide any storage or caching for transformed data
- Distill will only support Python 3.5+
- Distill will only support x64 architecture
Requirements
# | Title | User Story | Importance | Notes |
---|---|---|---|---|
1 | Accommodate arbitrary data schemas (not just UserALE): allows us to expand to cover arbitrarily many data sources without needing to write new methods for Distill | SHOULD HAVE | ||
2 | Distill is primarily designed to support UserALE.js
| MUST HAVE | ||
2 | Distill will support json datastores for datasource/query
| MUST HAVE | ||
3 | Installable as a Python library (via pip install) | MUST HAVE | ||
4 | Consumable as a RESTful web service | MUST HAVE | ||
5 | Provide a set of built-in data queries for common use cases | MUST HAVE | ||
6 | Allow users to make custom queries not covered by the built-in queries | MUST HAVE | ||
7 | Provide a set of built-in data transformations for common use cases | MUST HAVE | ||
8 | Thoroughly document the API using Sphinx so that users can extend Distill functionality to custom data transformations, analytics, schemas, etc. | MUST HAVE | ||
9 | Support Windows, Linux, and Mac users with OS-specific eggs/wheels | MUST HAVE | ||
10 | End-to-end encryption | MUST HAVE | ||
11 | Provide convenience libraries with pre-built data schemas for UserALE and other data streams | SHOULD HAVE |
Questions
Below is a list of questions to be addressed as a result of this requirements document:
Question | Outcome |
---|---|
| See requirement #1 |
2. Does Distill require a specific backend (Elastic) or can it go to Solr/Lucene | Underlying data store needs to support key value pairs |
3. How do we support Windows Users? | See requirement #9 |
4. How do we provide the "average" data scientist enough out of the box packages, modules to be minimally viable out of the box? | |
5. Roadmap for supporting packages and Anaconda distribution | |
6. Migrate to Django from Flask? | |
7. Is Distill simple python, or does it run as a service (or on a webservice) by design? | Both, see requirements #3 & #4 |
8. Does Distill manage scale in its connections to other datastores, or does it rely soley on Lucene based services (Elastic)? | Distill's querying is dependent on how well Elasticsearch scales on query. |
9. Does Distill remain tethered outright to Elastic? | See requirement #2 |
10. TLS or SSL: Modern vs. Legacy network support. |
Analytics & Processing Examples
These examples are here for drawing out higher-level goals for Distill's functionality. This section can be removed once the goals have been solidified.
Here is a model data pipeline for Apache SensSoft: RAW DATA>QUERY>FILTER/Q&A>TRANSFORMATION>PRIMITIVE FEATURE EXTRACTION>TRAINED MODELING>DERIVED FEATURE EXTRACTION
There are a few different classes of libraries that Distill might include in support of this pipeline; they have different consequences for workflows with in larger analytic pipelines.
- QUERY: We may want to be able to recreate previous queries used for other analyses, not necessarily "save" queries.
- FILTERING/Q&A: Elimination of data from query return, when that data can't be eliminated by query alone because some pattern to be filtered is fully nested within some query index.
- EX: Filter out specific save events from osquery object access data that do not coincide with click/keyboard activity with KM Logger.
- EX: Random resampling of km-logger events time-series–random sample every 1/min interval
- TRANSFORMATION: Native format of Lucene-like DataStores is a list of records, called as JSON through querie
- EX: Query data from one or more data sources (bunch of JSON), impose structure on JSON so that we represent as list object of logs ordered by timestamp (TS)
- EX: Query data from one or more data sources (bunch of JSON), impose structure on JSON so that we represent as list object of logs ordered by timestamp (TS)
- PRIMITIVE FEATURE EXTRACTION
- EX: Query for UserALE.js data and type=="click", then by userId, then aggregate across some time interval (e.g., count) by unique userId (within-user), return
- EX: Query for UserALE.js data and type=="click", then by userId, then aggregate over logs by unique user Id (e.g., count, mean, media, mode, variance, range) (between-user)
- EX: Using count data (EX a), bin by "path", create probabiliy of clicking on X path.
- TRAINED MODELING
- EX: Call or recreate PRIMITIVE FEATURES, then feed features to Graph Methods, NN or HMM, etc. (see this paper), return model params to Python Env.
- EX: Build a simple directed graph (like bowie http://senssoft.incubator.apache.org/) that shows stochastic relationships between elements, or pages, return model params to Python Env. as well as in/out degree, centrality metrics.
- DERIVED FEATURE EXTRACTION
- EX: Call or recreate 5, extract model features as in 4, return to Python Env.
- Build intervals from matching sequences of raw events
- Filter out unwanted events
- Noisy/irrelevant events
- May be conditional on neighboring events
- "dangling" events (e.g. a stop event with no corresponding start)
- Noisy/irrelevant events
- Collapse duplicate events into a single event (when is this preferable to creating an interval?)
- Create "sandwiches" (a set of events bookended by, e.g., a related start and stop event)
- Replace some logs/data with other logs/data
Not Doing
- We are NOT competing with Anaconda.
- We are NOT supporting multiple versions of Python.
27 Comments
Joshua C. Poore
Todd Nelling: We want people to be able to pull returns from Distill from things like PyCharm, etc. Not just from server-side batch scripts. Scripts is repeated. Fixed
Joshua C. Poore
Todd Nelling: No. Elastic/Kibana already provides a nice tool for visualizing streaming data. A server-side script can print values in near-real time back to Elastic records, which can then be visualized in Kibana. Distill polls from data that is already-collected as you say, for use in exploratory visual analytics v. real-time visualization.
Michelle Beard
I don't understand the motivation behind the IDE support?
Michelle Beard
Just UserALE, independent of what generated it.
Michelle Beard
i.e. Elasticsearch can be treated as read-only datastore.
Joshua C. Poore
Good. Like that assumption.
Michelle Beard
I suggest Python 3.5 and above.
Michelle Beard
This is pretty vague. Can someone describe the motivation behind this requirement (aka fill out the user story).
Joshua C. Poore
allow us to expand to cover arbitrarily many data sources without needing to write new methods for Distill
we would just need to create new configuration blocks
Michelle Beard
I suggest a main requirement. The datasource must be a document store.
Michelle Beard
Wrong term. Installable as a pip package.
Michelle Beard
Use sphinx.
Michelle Beard
x64 architecture. Must have eggs that are dependent on OS.
Michelle Beard
SSL or TLS?
Michelle Beard
And this should be a MUST HAVE.
Michelle Beard
Distill's querying is dependent on how well Elasticsearch scales on query.
Michelle Beard
You actually need to interview potential users.
Michelle Beard
Easier to do it in UserALE.
Todd Nelling
It is done in UserALE, but my understanding is that there are higher-level intervals that can be drawn out that UserALE itself can't necessarily identify at runtime.
Joshua C. Poore
Out of scope. That would be a simple aggregation across a brick of other intervals. Let's just call that simple aggregation, which is really re-sampling in this case.
Michelle Beard
Don't see how this is different from intervals?
Todd Nelling
Hence the parenthetical question. This list was pulled from a set of scripts used to transform UserALE data, and I simply made note of what I found.
Joshua C. Poore
This is resampling with conditional logic.
Michelle Beard
I was under the impression that Distill cannot write back to Elasticsearch?
Todd Nelling
The scripts I saw didn't write the replaced data back to Elasticsearch, they simply replaced it in-memory.
Joshua C. Poore
I agree with read-only. But I don't necessarily get the in-memory bit.
Todd Nelling
In-memory just means that the results weren't being written back to Elastic. I don't know where the output of the scripts ended up.