Description
Hi!
I have some experience working with time series (from medical sensors), and I was thinking of using TimeSeries.jl for my projects. For now I have some sort of review of this package, outlining choises that look strange to me, at least from docs, along with proposals from my point of view. Maybe authors will find it helpful.
I. AbstractTimeSeries
is absent from docs - is this some kind of common interface for different timeseries types? If so, you should add an example, which methods should I implement to support custom timeseries type.
II. Heterogenous series (tables) are dropped, from docs:
"All the values inside the values array must be of the same type."
This is a huge limitation, if one needs timeseries with complex information, stored as vector of structures, or a namedtuple of columns of different types (see StructArrays.jl).
Maybe there should be a different TimeTable type with heterogenous columns (similar to DataFrame), and TimeArray for a single column type, sharing the same timestamps from parent table?
More than that, individual columns can be a custom AbstractVector with some metadata for exotic element types. For example, if elements are encoded and metadata is needed to decode them on getindex
:
- from point to physical unit, or
- from number to a category name (CategoricalArrays.jl), or
- from 8-bit element to 8 binary values (each encoded by bit position).
III. There is no separate implementation for timeseries with regular sample rate, that can be constrained to operations that produce a uniform sampling (similar to SampledSignals.jl). This type does not need to store materialized timestamps
vector at all, since time can be calculated from index
, startdate
and samplerate
(I call this a "time grid", which provides a index2time
and time2index
pair of functions). Timeseries remains uniform unless you want to take irregular / arbitraty samples from it - result is then converted to a common (non-uniform) timeseries with timestamps vector in it.
IV. There are no timeseries with several timestamp columns. In my practive, I always have three different timeseries types:
- series - regularly sampled timeseries
- events - irregularly sampled timeseries with one timestamp value
- segments - irregularly samples timeseries with two timestamp values (start, stop) for elements that have some extent in time.
There are several special cases for (3) with regard to indexing (what to do if I request time point inside the segment or time interval that partially overlap with segments on edges).
Maybe there can be even more exotic (or common) timeseries with more that two timestamps (each row is itself a repetition of some complex process in time with many "phases"), where you should explicitly choose, wich timestamp column you want to index by. But I would not complicate it that far.
V. Row indexing. You can index rows by:
- single integer,
- integer range / with step
- single Date
- multiple Dates
- range of Date with step
What is missing:
- Multiple integers (at least from docs).
- Intervals of Date with no step (or step = 0, it should return all values within two boundaries, regardless of time step).
- Combined time and integer index. Sometimes you want to get, say, next ten elements relative to some timepoint: "get 10 elements after 2020-12-01", or even "get 5 elements before and 5 elements after 2020-12-01", or "get all elements between 2020-12-01 and 2020-12-31, including 1 previous and 1 next element" - so there can be a combination of two relative indexes, AND a time point / interval. In fact, indexing only by integer value is a special case, when your time point equals to first element timestamp. (I use getindex method with both
time
andindex
positional arguments for different combinations).
VI. Splitting by condition section has two different sets of functions:
- special cases of
where
in tables, but for timeseries (when
,findwhen
,findall
), - other functions that duplicate indexing (
from
,to
).
VII. Maybe there should be some convention between functions that take and return timeseries, and functions that return standard vector types:
findwhen
vsfindall
;- select column as a simple vector, or a timearray with a single column
- logical operators returning a timearray of Bool, or a bitvector
Also, there may be some methods to toggle between timeseries type - and underlying Table type, or standard array / vector of tuples. This is similar to Tables.columntable from DataFrames, they are using it to toggle between type-stable and compile-friendly cases.
VIII. Operation on single columns - or whole timeseries
- Dot-wise operations:
only calculated on values that share a timestamp.
this is very tricky part, because there is implicit inner join, and all columns should be the same numeric type. So maybe it should be applied only on a single column, or a single column can be modified this way inplace? This is also about heterogenity, as in section II above.
- For Apply methods there are many standard or third-party package functions (from DSP, RollingFunctions, etc.) that apply on simple Vector types, so there is no need to rewrite them all, if you can provide one common syntax to "wrap" any function on individual column. Then you can replace
diff
,percentchange
,moving
,upto
with similar functions from any other package. - Also,
basecall
looks strange to me - what if I want to run function not from Base, and run it on a single column, or a set of selected columns?
IX. Combine methods
- Why
merge
naming instead of more commonjoin
? collapse
: AFAIK this is calleddecimation
orresampling
with another samplerate or time intervals - not onlyday
,week
, etc. Maybe even a vector of custom intervals. And there should be any arbitrary function, that canreduce
all elements that fall within each time interval (for example, you can get time distribution, if you count number of elements over a fixed time intervals)
X. Customize TimeArray printing
Can I choose a time string format to show, or is it chosen automatically based on - what? It would be nice to have examples for high-frequency timestamps in units of milliseconds.