wangjianlongnba / transdim

Data imputation for urban transportation system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

transdim

Transportation data imputation (transdim).

Contents

Strategic aim

Creating accurate and efficient solutions for the spatio-temporal traffic data imputation and prediction tasks.

Tasks and challenges

  • Missing data imputation

    • Random missing: each sensor lost their observations at completely random. (simple task)
    • Fiber missing: each sensor lost their observations during several days. (difficult task)
  • Rolling traffic prediction

    • Forecasting without missing values. (simple task)
    • Forecasting with incomplete observations. (difficult task)

What we do just now!

  • add a framework indicating overall studies;

framework

Framework: Tensor completion task and its framework including data organization and tensor completion, in which traffic measurements are partially observed.

  • define the problems clearly;

    • Example: Traffic forecasting using matrix factorization models.

      example

Real experiment setting: Observations with 0%, 20% and 40% fiber missing rates during first 56 days are treated as stationary inputs. Meanwhile, there are some rolling inputs for forecasting traffic speed during last 5 days (from Monday to Friday) in a rolling manner.

  • describe the core challenges intuitively;
  • list main contributions of these studies.

What we care about!

  • Best algebraic structure for data imputation.
  • The context of urban transportation (e.g., biases).
  • Data noise avoidance.
  • Competitive imputation and prediction performance.
  • Capable of various missing data scenarios.

Overview

With the development and application of intelligent transportation systems, large quantities of urban traffic data are collected on a continuous basis from various sources, such as loop detectors, cameras, and floating vehicles. These data sets capture the underlying states and dynamics of transportation networks and the whole system and become beneficial to many traffic operation and management applications, including routing, signal control, travel time prediction, and so on. However, the missing data problem is inevitable when collecting traffic data from intelligent transportation systems.

Publicly available at our Zenodo repository!

example (a) Time series of actual and estimated speed within two weeks from August 1 to 14.

example (b) Time series of actual and estimated speed within two weeks from September 12 to 25.

The imputation performance of BGCP (CP rank r=15 and missing rate α=30%) under the fiber missing scenario with third-order tensor representation, where the estimated result of road segment #1 is selected as an example. In the both two panels, red rectangles represent fiber missing (i.e., speed observations are lost in a whole day).

Machine learning models

  • Missing data imputation

Urban traffic speed data set (i.e., Guangzhou-data-set(Gdata)) registered traffic speed data from 214 road segments over two months (61 days from August 1 to September 30 in 2016) in Guangzhou, China. We organize the raw data into a time series matrix of (214, 8784). For tensor-based models, we use a third-order tensor (214, 61, 144) as input. Matrix based models are tested with the time series matrix (214, 8784).

We consider two common missing data scenarios (i.e., random missing (RM) and non-random missing (NM)). For RM, we simply remove certain amount of observed entries in the matrix randomly and use these entries as ground truth to evaluate RMSE. For NM, we apply correlated fiber missing experiment by randomly choosing certain amount (e.g., 40%) (location, day) combinations and removing the whole time series in each combination.

Model Paper Data set Missing RMSE Our implementation
PMF Salakhutdinov et al., 2007 Gdata 20%, RM 4.0909 Jupyter Notebook
GAIN Yoon et al., 2018 Gdata 20%, RM 4.6718 Jupyter Notebook
PMF Salakhutdinov et al., 2007 Gdata 40%, RM 4.2280 Jupyter Notebook
GAIN Yoon et al., 2018 Gdata 40%, RM 5.1776 Jupyter Notebook
PMF Salakhutdinov et al., 2007 Gdata 20%, NM 4.3575 Jupyter Notebook
GAIN Yoon et al., 2018 Gdata 20%, NM 6.5500 Jupyter Notebook
PMF Salakhutdinov et al., 2007 Gdata 40%, NM 4.4866 Jupyter Notebook
GAIN Yoon et al., 2018 Gdata 40%, NM 6.9947 Jupyter Notebook
  • PMF: Probabilistic matrix factorization.

    • The code1 and code2 have been adapted for our implementation.
  • GAIN: Generative Adversarial Imputation Nets.

    • The code has been adapted for our implementation.
  • LocInt: local interpolation.

    • This model considers local information from observations at the neighboring time slots of the missing values.
  • TRMF: Temporal regularized matrix factorization. [Matlab code is also available!]

    • Alleviating hyperparameters setting is a rewarding way.
  • BGCP: Bayesian Gaussian CP decomposition. [Imputation example - Jupyter Notebook] [Matlab code is also available!]

  • BPMF: Bayesian probabilistic matrix factorization.

  • HaLRTC: High accuracy low rank tensor completion.

Selected references

Our publications

Please consider citing our papers if they help your research.

Our blog posts (in Chinese)

License

This work is released under the MIT license.

About

Data imputation for urban transportation system

License:MIT License


Languages

Language:Jupyter Notebook 100.0%