Fully Context-Aware Video Prediction
This paper proposes a new neural network design for unsupervised learning through video prediction. Current video prediction models based on convolutional networks, recurrent networks, and their combinations often result in blurry predictions. Recent work has attempted to address this issue with techniques like separation of background and foreground modeling, motion flow learning, or adversarial training. We highlight that a contributing factor for this problem is the failure of current architectures to fully capture relevant past information for accurately predicting the future. To address this shortcoming we introduce a fully context-aware architecture, which captures the entire available past context for each pixel using Parallel Multi-Dimensional LSTM units and aggregates it using context blending blocks. Our model is simple, efficient and directly applicable to high resolution video frames. It yields state-of-the-art performance for next step prediction on three challenging real-world video datasets: Human 3.6M, Caltech Pedestrian, and UCF-101 and produces sharp predictions of high visual quality.
READ FULL TEXT