IoTDB as Time Series Storage

Hi all,

I'm currently working on supporting Apache IoTDB as an additional option to store our time series data in StreamPipes.
My experience with IoTDB is limited to reading the documentation, performing some small evaluations and the implementation effort I've spent sofar.
Thus, the following is based on my current understanding of IoTDB's concept and may therefore be erroneous or incomplete.

## What is already possible

As of now, we have a very rudimentary support for Apache IoTDB as time series storage in our development branch.
Using IoTDB as time series storage is possible by setting a set of environment variables.
This effects that all time series data persisted in StreamPipes is persisted in the IoTDB.

Under the hood this works the following:

Let's assume we have created a data stream in StreamPipes with help of the [machine data simulator](https://streampipes.apache.org/docs/pe/org.apache.streampipes.connect.iiot.adapters.simulator.machine/)
and persist the data in StreamPipes into a measure called `flowrate`.
StreamPipes now has a so called `DataLakeMeasure` with the name `flowrate` and the following schema:
* `timestamp`: Long
* `sensorId`: String
* `mass_flow`: Float
* `volume_flow`: Float
* `temperature`: Float
* `density`: Float
* `sensor_fault_flags`: Boolean

Every event of our data stream is written into aligned time series of one device.
The device name in our scenario equals the name of the data lake measure (`flowrate`) and the following path is used within IoTDB: `root.streampipes.flowrate`.

This is how the data looks like:

```sql
select * from root.streampipes.flowrate limit 10 align by device
```

In addition, the count mechanism to get an overview about how many events exists per data lake measure is implemented as well:

Using the following query:
```sql
Select count(temperature) from root.streampipes.flowrate
```

Well, so far so good. Getting here was straightforward and IoTDB is easy to work with, it's a really cool piece of software!

## Where problems arise

The approach outlined above is straightforward and works well with StreamPipes.
However, there are two aspects we haven't considered yet that make things more challenging and are currently unclear how to support:
* Dimension properties
* Complex data types (mainly lists and nested data)

The latter is not directly relevant, but if anyone has a viable suggestion, I'd really appreciate it.
So let's have a deeper look on dimension properties.
In StreamPipes dimension properties refer to event properties (event fields) that represent dimensions.
In the example given above, `sensorId` would typically be modeled as a dimension property.
As such `sensorId` has discrete value space containing values like `flowrate01` and `flowrate02`.
Dimension properties allow users, e.g., to group data along the provided dimensions in the data explorer.

There could be more than one dimension property such as `locationId` and `sensorId` and users can potentially change an existing adapter and remove or add a dimension.

If our `flowrate` data stream is persisted as above, there is no way to group data based on `sensorId` best to my knowledge,
e.g., to calculate the average temperature per `sensorId`. Or is there anything I'm missing?

One possible solution to this, is to split the data of our data stream into multiple devices in IoTDB and work with tags.
The idea is now to have a device per value of a dimension property, in this case `flowrate.flowrate01` and `flowrate.flowrate02`.
This allows us to tag the corresponding time series, e.g., `flowrate.flowrate01.temperature` with a corresponding tag: `sensorId=flowrate01`.
This brings us the desired capability of grouping values along dimensions:

```sql
select avg(temperature) from root.streampipes.flowrate.** GROUP BY TAGS(sensorId)
```

Please excuse the different tag name and values in this screenshot, but I think you get the point.

As an alternative, we could also imagine to not use tags and use the `GROUP BY LEVEL` statement.

The good side here is, that this allows to calculate aggregations based on dimension values, it also has some downsides.
First of all, it would break with the current relationship of a data lake measure having one measurement in the time series storage.
Modeling as described here would result potentially in a huge amount of time series since there must be one for each combination of dimension property values.
In addition, it would make our queries more complex since we are not able to directly query all data for a data lake measure or to count all records for one data lake measure without further computations.

What are your thoughts about this considerations?
Do I use the concepts in the right way?
Are there any alternatives we could achieve the same results?
What if an adapter is modified? E.g., a dimension field is removed?

## What should be possible in the end

Beyond the scenario above we require the following functionalities to be performed via queries from IoTDB which may rise further considerations/issues and should be considered ideally in finding a solution:

- simply list all values per data lake measure (`Select *`)
- filter data based on values of dimension properties
- group data based on a dimension property (e.g., show a line series per dimension property in the data explorer)

Page tree

IoTDB as Time Series Storage