Discovering and Resolving Anomalies in Smart Cities
Introduction
Understanding complex activity due to humans and vehicles in a large environment like a city neighborhood or even an entire city is one of the main goals of smart cities. The activities are heterogeneous, distributed, vary over time, and mutually interact in many ways, making them hard to capture, understand and mitigate issues in a timely manner. That said, there has been tremendous progress in capturing aggregate statistics that help in traffic and city management as well as personal planning and scheduling. However, all of these mechanisms, while very useful, ignore anomalous activity and patterns. For example, an elderly person may trip and fall while crossing the street. In this case, it would be critical to delay the red traffic signals to allow them time to safely finish crossing. Many other examples include roadwork areas, near misses between vehicles and pedestrians or bicyclists, erratic and aggressive driving, etc. Automatically identifying (cyber) and resolving (physical) the virtually endless series of such anomalous patterns sieved out of large amount of data is critical for smart city technologies to be successful. Discovering and resolving anomalies is very challenging for many reasons as they are complex and rare, depend on the context and depend on the spatial and temporal extent over which they are observed. This work conducts multi-disciplinary cyber-physical systems research to address automatic discovery and resolution of anomalous patterns in smart city visual data.
During the second year of this award, great progress was made towards accomplishing our goals for this award. A large amount of visual data has been collected and manual labeled of roadwork. The labeled roadwork dataset is essential to training models to identify areas of roadwork, a type of anomalous activity. Additionally, a method was developed to augment images with objects that are more rarely observed, but important to identify in specific environments. This will permit training models when insufficient ‘real world’ data exists. Six cameras were installed in Shaler Township, which is in the Greater Pittsburgh Area. These cameras will provide much needed visual data to develop and test algorithms for anomaly detection, as well as compute analytics relevant to the township’s goal of improving mobility along their main corridor. A second public transit bus has been instrumented with cameras and on-board computing to monitor roadside conditions in Washington, PA. For example, determining where sidewalks have or have not had snow removed or whether city trash cans are full and need to be emptied. Methods were developed for object detection, segmentation, and tracking fail in the presence of severe occlusions in busy urban environments by using self-supervised learning on longitudinal data. This work produced a novel dataset that is available for research purposes. Our analysis architecture has been improved permitting the ingress of data from virtually any camera around the world. We have developed methods for automatically localizing these cameras and estimating their intrinsic parameters, which permits the estimation of ‘real-world’ parameters such as vehicle speed. Finally, we developed preliminary methods to identify anomalous activity based on significant changes in vehicle/people counts over various periods of time and roadworks. A portion of this work supported two successfully defended Master’s Theses and two Robotics Institute Summer Scholars. One peer review paper was submitted and accepted to a top tier computer vision conference. Two other peer reviewed papers are currently under review.
This work is conducted in collaboration with the partners listed here.
Visual Data Sources
Static City Infrastructure Cameras
Six LTE cameras were installed in Shaler Township, which is in the Greater Pittsburgh Area. These cameras will provide much needed visual data to develop and test algorithms for anomaly detection, as well as compute analytics relevant to the township’s goal of improving mobility along their main corridor. Two cameras are installed at a busy signalized intersection, which is heavily trafficked due to its vicinity to a shopping center, middle school, library, and park. Another camera is positioned to observe a nearby crosswalk. The other three cameras are located along Shaler’s main corridor to observe vehicle and pedestrian activity in relation to some of the township’s main gathering places. All of these cameras are integrated into our real-time processing pipeline for detecting and tracking vehicles and pedestrians.
Additional Bus with Interior and Exterior Cameras
Many vehicles like transit buses are now routinely fitted with cameras. These live visual data are invaluable to achieve real-time traffic and infrastructure monitoring and anomaly detection (such as landslides), but it is intractable to handle such a gigantic amount of data either locally or in the cloud due to computation or bandwidth limitations. A system was developed called BusEdge that uses edge computing to achieve efficient live video analytics on transit buses. The system uses an in-vehicle computer to preprocess the bus data locally and then transmits only the distilled data to the nearest cloudlet for further analysis. The system provides an easily extensible and scalable platform for related applications to make use of the live bus data. The main components of the system are shown in Figure 1. Sensors (cameras, GPS, and IMU) and a computer were installed on a public transit bus. The computer ingested data from the sensors and applied various computer vision algorithms to the visual data. Processed data and images are wirelessly transmitted to a cloudlet for building a more sophisticated model and for computing analytics. Also in the cloudlet, resulting analytics are visualized on a map. We now have a second bus equipped with cameras on the exterior of the bus. There are also cameras on the interior of the bus so that passengers may be tracked for the duration of their trip. The second bus also has higher performance CPUs and a GPU to enable live video analytics and operates in Washington, PA.
We have focused on roadworks as a type of anomalous activity because it is highly disruptive to travel. However, identifying roadworks is extremely challenging because there are many different types of roadwork that range from fixing a street light to repaving the road surface. Additionally, each type of roadwork is entirely homogenous in appearance. For example, even the simplest case of changing the light source in a street light may or may not have a bucket truck and if there is a bucket truck may be fully or partially blocking the road and the bucket truck may or may not have cones around it. In order to develop algorithms for automatic detection and understanding of roadworks, a manually annotated data set is needed to train and test deep learning models. To accomplish this goal, images were collected on an iPhone while driving around the Greater Pittsburgh area. In total, over 3,000 images were collected that contained roadworks or objects related to roadworks. Objects associated with roadworks were manually segmented using the CVAT tool. The labels are shown in Figure 2 and were chosen for each of the objects following the Federal Highway Administrations Manual of Uniform Traffic Control Devices. In addition to object segmentations, images were tagged with various metadata for filtering the data for training/testing along with a general description of the scene. An example of a segmented image is shown in Figure 3.
Watch and Learn Time-Lapse (WALT) Dataset
The Watch and Learn Time-lapse (WALT) dataset, consists of images from 12 cameras (4K or 1080p) capturing urban environments over a year. The cameras view a diverse set of scenes from traffic intersections to boardwalks. The dataset was captured as part of our methods developed for robust object detection in the presence of occlusion (method described here). The dataset is novel and potentially benefit to the computer vision and machine learning communities. We have thus made it available to download for personal use.
Anomalous events can sometimes be identified by anomalous or rarely observed objects such as emergency vehicles. Synthetic data generation can augment existing datasets and consequently helps train more robust computer vision models. However, synthetic image generation techniques proposed by prior works still face limitations in generating photorealistic data, maintaining low computation costs, and granting fine control over scene generation parameters. Synthetic data generation would be especially useful for training deep learning models for traffic analysis tasks. Therefore, we propose a photorealistic synthetic road scene generation method that inserts rendered 3D objects into a real 2D photo (Figure 4). We first estimate the ground plane equation, camera parameters, possible vehicle trajectories, and environment illumination map from the road scene photo. Then, these scene parameters are used to render the 3D objects in a physics-based renderer. Finally, we compose the rendered object smoothly into the road scene. Example composed images are shown in Figure 5. Simultaneously, the renderer can generate precise depth maps. Our “mixed reality” approach’s results are higher resolution and more photorealistic compared to similar previous works while addressing their limitations. Thus, our approach can generate high quality synthetic images and ground truth labels for a variety of computer vision tasks. In future work, we plan to evaluate whether our synthetic data and ground truth labels can improve deep neural network performance on challenging tasks like amodal segmentation.
>> Download paper here >> Download poster here >> Download code here
Knowing the location of cameras is critical in understanding the context of detected activity. However, localizing the location of a camera is difficult if physical access to the cameras is not possible. We have developed a method for automatically localizing a camera. Additionally, the methodology enables the estimation of the intrinsic and extrinsic parameters of cameras, which is critical for performing reconstruction. Given the camera's GPS location, leverage Google Street View (GSV) is leveraged to build the scene's geometry at that location. GSV is a street-level imagery database and a rich source of millions of panorama images with wide coverage all over the world. Every panorama image is geo-tagged with accurate GPS coordinates, capturing 360 degree horizontal and 180 degree vertical field-of-view with high resolution. We sample multiple panoramas around the desired camera's location inside a radius of 40 meters and use structure-from-motion (SfM) to reconstruct the scene (Figure 6). Note that we also geo-register the up-to-scale SfM reconstruction using the provided GPS coordinates of the GSV panoramas. Thus, our final 3D reconstruction of the scene is on the metric scale.
To obtain the camera's intrinsic and extrinsic parameters, we follow the typical visual localization pipeline by localizing the desired background image (i.e., query image) with respect to the 3D reconstruction built with GSV images (i.e., database images). To establish robust 2D-3D correspondences, we use a learned feature matching method (SuperGlue) with SuperPoint features descriptors to match the query image with the database images. Given the 2D-3D correspondences, we perform a bundle adjustment step to retrieve the camera intrinsic and its 6DoF extrinsic parameters. The large number of accurate matches between the query image and the rich GSV database images, produced by the learned feature matching modules, allows us to robustly recover both intrinsic and extrinsics parameters of the camera. We have shown the robustness of our methods by reconstructing and localizing more than 70 cameras from publicly available, in-the-wild video streams all over the world. Automatic localization of 7 static cameras at an intersection in Pittsburgh are shown in Figure 7.
WALT: Watch And Learn 2D Amodal Representation from Time-Lapse Imagery
Current methods for object detection, segmentation, and tracking fail in the presence of severe occlusions in busy urban environments. Labeled real data of occlusions is scarce (even in large datasets) and synthetic data leaves a domain gap, making it hard to explicitly model and learn occlusions. In this work, we present the best of both the real and synthetic worlds for automatic occlusion supervision using a large readily available source of data: time-lapse imagery from stationary webcams observing street intersections over weeks, months, or even years. We introduce a new dataset, Watch and Learn Time-lapse (WALT), consisting of 12 (4K and 1080p) cameras capturing urban environments over a year. We exploit this real data in a novel way to automatically mine a large set of unoccluded objects. We develop a new method to classify unoccluded objects based on the idea that when objects on the same ground plane occlude one another, their bounding boxes overlap in a particular common configuration. Then these objects are composited in the same views to generate occlusions.
This longitudinal self-supervision is strong enough for an amodal network to learn object-occluder-occluded layer representations. We showed how to speed up the discovery of unoccluded objects and relate the confidence in this discovery to the rate and accuracy of training occluded objects. After watching and automatically learning for several days, this approach shows significant performance improvement in detecting and segmenting occluded people and vehicles, over human-supervised amodal approaches (Figure 8).
>> Project webpage with code, publication, and presentation here
Leveraging Structure from Motion to Localize Snow Covered Sidewalks
The detection of hazardous conditions near public transit stations, such as snow coverage on sidewalks near bus stops, is necessary for ensuring the accessibility of public transit. Smart city infrastructures aim to facilitate this task among many others using computer vision. However, most state-of-the-art computer vision models require thousands of images to perform accurate detection, and there do not exist many images of hazardous conditions as they are generally rare. In this paper, we examine one such condition: snow-covered sidewalks. Previous work has focused on detecting vehicles in heavy snowfall or simply detecting the presence of snow. However, our application has an added complication of making the distinction that the snow covers areas are of importance and can cause falls or other accidents (e.g., sidewalks) versus snow covering areas that are not frequently trafficked (e.g. snow in a field). This problem involves localizing the locations of accumulated snow and the areas of importance. We introduce a method that utilizes Structure from Motion (SfM) rather than annotated data to address this issue. Specifically, we reconstruct the positions of sidewalks on a given bus route by applying a segmentation model and SfM to images of clear sidewalks from bus cameras (Figure 9). Then, we detect if and where the sidewalks become obscured with snow. Although we demonstrate an application for snow coverage, this method can be adapted for other hazardous conditions as well.
>> Download paper here >> Download poster here >> Download code here
Detecting and Classifying Bus Stop Trash Cans
Trash cans are a central tool in managing the disposal of trash in urban areas but require human supervision to ensure regular emptying. It is difficult to manage many waste bins spread across a whole city, which presents an opportunity for computer vision technology to identify cans that require attention without human intervention. Previous work has leveraged a camera-equipped bus to deploy a single deep learning-based computer vision model to detect trash cans along the path of the bus and classify their fill level. We improve upon their work by presenting a multi-stage pipeline that combines their detection model with a separate, second model trained purely for classification (Figure 10). Our server-side detector was trained using the Resnet101+FPN backbone for Faster R-CNN available within the Detectron2 model zoo, which we trained for approximately 30,000 iterations, while evaluating the validation set every 1,000 iterations. The bus-side detector relies on the same architecture, but we compare different, simpler backbones capable of running on mobile devices, including Resnet18+FPN and MobilenetV2. The detector identifies and cuts out trash cans from an image, which are then classified as either “Empty”, “Full” or “having a garbage bag next to it”. Our classifier is also based on Resnet101. However, we replaced the final fully-connected layer with a smaller fully-connected layer to account for our three classes. We also applied the Softmax function, so that the output becomes a probability vector. This model was trained for 10 epochs. Our approach significantly increases the overall accuracy and precision for both tasks, as calculated by the commonly used COCO metrics. Additionally, we present a lightweight variant of our detection model, which can be run on the bus itself, where only limited computational resources available. This enables us to deploy our system for a near real-time setting.
>> Download paper here >> Download poster here >> Download code here
Preliminary methods for anomaly detection were developed using the results of the computer vision analysis. Spatial-temporal scans were applied to detect anomalies of object behaviors on longitudinal images from traffic cameras. For each camera, a 2x2 contingency table is used to analyze the counts of a given object to determine if there is an association between the two factors. Each object is separated into one of four categories based on two factors, each with two possibilities. For example, a target time series could be the daily total number of detected vehicles across time in a scene, while the baseline time series could be the daily total number of people detected in the same scene. A reference window provides information of previously observed traffic, e.g. total number of vehicles and total number of peoples in the past week. The query window is tested for potential anomalous patterns with comparison to the reference window as a significant increase or decrease in number of vehicles versus the number of people.
Given the choices of query and reference time windows, a Chi-square test (or Fisher’s exact test for very low counts) of the 2x2 contingency table is performed with p-values indicating the significance of anomalous patterns. The same tests will be performed for moving windows across time, which yields a time series of p-values as the anomaly signals. The current approach detects anomalies efficiently across time. Figure 11 shows the automatic detection of an anomalous event, a statistically significant increase in the number of people with comparison to the number of vehicles on the road next to an Italian restaurant around June 27, 2021 due to a road closure for a social event. For future work, methods are being developed to detect anomalies within different spatial regions of a scene, e.g., sidewalk versus road. This will be done by modifying the target and baseline time series definitions and using semantic segmentation to identify regions of interest. We are also integrating object trajectories and speed estimation as other measures for anomaly detection.
Detection of Roadwork Zones
This work captures the spatial and temporal information of a roadwork zone. To train our object (cone) detection model, we used the traffic cone class and tested our result on the traffic cone class of bus data. Our goal is to detect and analyze instances of work zones from a set of raw bus data. To do so, we introduce a method for mapping data points from bus trajectories into a discretized 2D space defined by a spatial axis and a temporal axis. The temporal axis is defined by the start times for each bus run. The spatial axis is a sequence of baseline coordinates. The collection of bus data is assigned to a point on the baseline trajectory. Next, we use a bottom approach to evaluate the anomaly score at each data point. Finally, we visualize the anomaly scores on a discretized 2D space defined by a spatial and temporal axis. From this 2D map, we discuss the spatial and temporal attributes that we discovered by mining bus data for work zone events.
Figure 12 illustrates a large scale work zone region where half of the road is blocked. From the heat map, we see an increase in traffic cone counts near the start of the segment, then a decrease at the tail of the region. Looking at the matching images, we see that at the beginning of the work zone, there is a denser placement of traffic cones to notify the incoming drivers. In the middle sections of the work zones, the traffic cones are placed more sparsely. The gradient pattern on the heat map gives clear indications of the spatial boundaries of the work zone. Using this pattern, we can efficiently identify the potential boundaries of the work zones at a coarse level (from downsampled trajectories). These data can help learn to predict the spatial boundaries of construction zones when faced with roadside anomalies.
From the heat map in Figure 13, we see a similar gradient effect along the temporal axis at a specific location Li. Looking at the corresponding image, we see that the location Li is undergoing a long-term construction that spans multiple runs of the bus. The bus data clearly captured the progression of this work zone from start to finish. From changes along the temporal axis, we were able to pinpoint a specific region of change among the vast amount of large-scale bus data. By mapping the bus data onto a spatial-temporal structured space, we were able to capture the progress of a work zone from start to finish. These data could be invaluable to construction management companies for an automated process of monitoring the progress of work zones.
This work focused on training 2D object detectors and detecting crosswalk changes across time with vehicle-mounted cameras. To use crosswalk detections for change prediction, the crosswalk detections need to be spatially aligned across different times. Because of challenges in performing change detection from the perspective view, this chapter explores methods to map crosswalk detections out of the image plane and onto a global 3D ground plane to perform crosswalk change detection from the bird's-eye view. Not only does working in a ground plane avoid the need to have aligned images, it allows for detections in sequential dash cam images to be composited into a single coordinate frame to better represent the scene. An example is when one image captures the front crosswalk of an intersection well while the image after moving forward ten meters captures the back crosswalk well. For this work, two datasets were introduced: one that uses safety cameras from a single bus that regularly passes the same locations and another that consists of images from dash cams from multiple vehicles. Our proposed method shows robustness in high-traffic areas and can localize changes to a specific crosswalk at an intersection. Qualitative results of six of the 17 changes are shown in Figure 14, where many crosswalks have been removed due to road repaving. Each crosswalk is represented by a polygon, and the color indicates the type of change. If any crosswalk experiences a change, the change prediction for the scene is positive. For the change dataset collected from multiple sources, eight changes and eight no changes exist. The accuracy is 0.88 for change detection and 0.63 for no change detection. Because the images are taken from multiple sources, image quality, size, distortion, and viewpoints are different. This leads to different data distributions of how crosswalks appear in the image. As a result, the detection accuracy suffers and causes poor change detection results.