Modeling Irregular Time Series with Continuous Recurrent Units
Mona Schirmer
Mazin Eltayeb
Stefan Lessmann
Maja Rudolph
Abstract
Recurrent neural networks (RNNs) like long short-term memory networks (LSTMs) and gated recurrent units (GRUs) are a popular choice for modeling sequential data. Their gating mechanism permits weighting previous history encoded in a hidden state with new information from incoming observations. In many applications, such as medical records, observations times are irregular and carry important information. However, LSTMs and GRUs assume constant time intervals between observations. To address this challenge, we propose continuous recurrent units (CRUs)- a neural architecture that can naturally handle irregular time intervals between observations. The gating mechanism of the CRU employs the continuous formulation of a Kalman filter and alternates between (1) continuous latent state propagation according to a linear stochastic differential equation (SDE) and (2) latent state updates whenever a new observation comes in. In an empirical study, we show that the CRU can better interpolate irregular time series than neural ordinary differential equation (neural ODE)-based models. We also show that our model can infer dynamics from images and that the Kalman gain efficiently singles out candidates for valuable state updates from noisy observations.
Machine Learning
1 Introduction
Recurrent architectures, such as the long short-term memory network (LSTM) [13] or gated recurrent unit (GRU) [8] have become a staple machine learning tool for modeling time series data. Their modeling power comes from a hidden state which is recursively updated to integrate new observations and a gating mechanism to balance the importance of new information with the history already encoded in the latent state.
Recurrent neural networks (RNNs) typically assume regular sampling rates. However, many real world data sets, such as electronic health records or climate data are irregularly sampled. In irregularly sampled time series data, the time intervals between observations can vary in length. In health care for instance, measurements of a patients health status are only available when the patients goes to the doctor. The time passed between observations also carries information about the underlying time series. A lab test not administered for many months could mean that the patient was doing well in the meantime, while frequent visits might indicate that the patient’s health is deteriorating. Standard RNNs have difficulty modeling such data as they do not reflect the continuity of the underlying temporal processes.
Recently, the work on neural ordinary differential equations (neural ODEs) [6] has established an elegant way of modelling irregularly sampled time series. Recurrent ordinary differential equation (ODE)-based architectures determine the hidden state between observations by an ODE and update its hidden state at observation times using standard RNN gating mechanisms (Rubanova et al.26, Brouwer et al.3, Lechner and Hasani 19). These methods rely on some form of ODE-solver, a network component that prolongs training time significantly (Rubanova et al.26, Shukla and Marlin 27).
In both standard and ODE-based RNNs, the hidden state is designed as a deterministic component opposed to a random variable. A probabilistic state space however, benefits from theoretical grounds for optimal solutions and allows tracking uncertainty concisely over the transformation steps of observations. Reliable uncertainty quantification is indispensable in decision making contexts such as autonomous driving or AI-based medicine.
We propose continuous recurrent unit (CRU), a probabilistic recurrent architecture for modelling irregularly sampled time series. An encoder maps observations into a latent space. In this latent space, the latent state of the CRU is governed by a linear stochastic differential equation (SDE). The analytic solution for propagating the latent state between observations and the update equations for integrating new observations are given by the continuous-discrete formulation of the Kalman filter [15]. Employing the linear SDE state space model and the Kalman filter has three major advantages. First, a probabilistic state space provides an explicit notion of uncertainty for both an uncertainty-driven gating mechanism and for confidence evaluation of predictions. Second, the gating mechanism is optimal in a locally linear state space. Third, the latent state at any point in time can be resolved analytically, therefore bypassing the need for numerical integration techniques. Our contributions are as follows:
We present the CRU, a time-series model that combines the power of neural networks for feature extraction with the advantages of a probabilistic state-space model, specifically the continuous-discrete Kalman filter (Section3.1).
The resulting neural architecture can process time-series data like a recurrent neural network by sequentially processing observations and internally updating its latent states. Due to the properties of the continuous-discrete Kalman filter, the model deals with irregular time intervals between observations in a principled manner.
An efficient parameterization of the latent state transition via the transition’s eigenspace (Section3.3) allows us to reduce the model’s complexity significantly and to trade off speed with accuracy.
In an empirical study (Section4) we compare the performance of CRU to discrete and continuous baselines on images, electronic health records and climate data. We find that
(i) the gating mechanism leverages uncertainty arising from both noisy and partially observed inputs
(ii) CRU outperforms discrete RNN counterparts on synthetic image data
(iii) our method can better interpolate irregular time series than neural ODE-based methods.
2 Related Work
Deep (probabilistic) sequence models
Classical deep recurrent time series models such as GRU [8] and LSTM [13] distill information from incoming data and preserve relevant information in the cell state over a long time horizon. However, it has been argued that a lack of randomness in the internal transitions of such models fail to capture the variability of certain data types (Chung et al.10). Probabilistic alternatives infer relevant information in a latent variable model. Recent work has integrated the discrete Kalman filter as probabilistic component into deep learning frameworks (Krishnan et al.18, Karl et al.16, Fraccaro et al.11, Becker et al.2). Our method builds up on the idea of Recurrent Kalman networks (RKNs) [2].
RNNs for irregular time series
Applying standard RNNs on irregularly sampled time series neccesarily preceeds the discretization of the time line into equally spaced bins. However, this typically reduces the number of observation times, results in a loss of information and evokes the need for imputation and aggregation strategies.
Approaches to circumvent prior preprocessing propose to augment the observations with timestamps (Choi et al.9, Mozer et al.24) or observation masks [21].
However, such approaches do not grasp the dynamics of the hidden state between observations. In Che et al. [5] and Cao et al. [4] the hidden state decays exponentially between observations according to a trainable decay parameter. The idea is to mirror increasing irrelevance of encoded history after much time has passed, which is an assumption that can be inappropriate in some use cases.
Neural ODEs
Neural ODE-based methods explicitly model the dynamics of the hidden state by an ODE. Chen et al. [6] proposes latent ODE, a generative model whose latent state evolves according to a Neural ODE. However, there is no update mechanism employed making it impossible to incorporate observations into the latent trajectory that arrive later. Kidger et al. [17] and Morrill et al. [23] extended neural ODE with concepts from rough analysis which allows for online learning. ODE-RNN [26] and ODE-LSTM [19] use standard RNN gates to sequentially update the hidden state at observation times. Both GRU-ODE-Bayes [3] and Neural Jump ODE (NJ-ODE) [12] enforce a Bayesian update by tightly coupling ODE dynamics and update step via their objective function. As Herrera et al. [12] point out, there is no theoretical ground that GRU-ODE-Bayes nor ODE-RNN yield optimal predictions. They prove convergence of their NJ-ODE model to the optimal prediction under Markovian assumptions. Let’s note that their framework does not model uncertainty in the gating mechanism.
3 Method
The CRU addresses the challenge of modeling a time series xT=[xt|t∈T={t0,t1,⋯tN}] whose observation times T={t0,t1,⋯tN} can occur at irregular intervals.
Figure1 illustrates the network architecture.
An encoder fθ and a decoder gϕ relate observation space with a latent observation and latent state space. CRU assumes Gaussian latent observations yt and a continuous latent state z whose dynamics are governed by a linear SDE. In latent space, we alternate between inferring the latent state posterior at observation times and propagating the latent state between observations continuously in time.
At each time point t∈T, an observation is propagated through four internal network steps:
A neural network encoder fθ maps the observation xt to a latent observation space and outputs a transformed latent observation yt along with an elementwise latent observation noise σobst.
encoder:[yt,σobst]=fθ(xt).
(1)
The posterior computation updates the latent state with the latent observation
p(zt|yt)=N(μ+t,Σ+t).
(2)
Between observations, the latent state prior evolves according to the linear SDE
dz=Azdt+Gdβ,
(3)
where A∈RM×M is a time-invariant transition matrix, β∈RB a Brownian motion process with diffusion matrix Q∈RB×B and diffusion coefficient G∈RM×B.
The decoder maps the posterior estimate to the desired output space along with an elementwise uncertainty estimate.
decoder:[ot,σoutt]=gϕ(μ+t,Σ+t).
(4)
The optimal solution for the latent state inference problem is given by the continuous-discrete Kalman filter. We first describe the continuous-discrete Kalman filter in the next section. Then we introduce design choices for CRU to ensure flexible yet fast state propagation.