A survey for articulation estimation
The articulation estimation could be divided into 3 types based on the input information.
According to the input information, the articulation estimation could be divided into 3 types.
- Single Observation Based
- CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
- CA2T-Net: Category-Agnostic 3D Articulation Transfer from Single Image
- Shape2Motion: Joint Analysis of Motion Parts and Attributes from 3D Shapes
- Multi-Observation Based
According to how to use the feature, the articulation estimation could be divided into 2 types.
- Exact Model Based
- CA2T-Net: Category-Agnostic 3D Articulation Transfer from Single Image
- Explict Model Based
- CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
- Implicit Model Based
According to the estimation method, the articulation estimation could be divided into 2 types.
- Non-Deeplearning Methods
- Supervised Deeplearning Methods
- CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
- Semi-Weakly Supervised Deeplearning Methods
- Semi-Weakly Supervised Object Kinematic Motion Prediction
- Non-supervised Deeplearning Methods
According to the stages of the articulation estimation, the articulation estimation could be divided into 2 types.
- Multi Stage
- Shape2Motion: Joint Analysis of Motion Parts and Attributes from 3D Shapes
- Single Stage (End-to-End)
- CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
- CA2T-Net: Category-Agnostic 3D Articulation Transfer from Single Image
- Shape2Motion: Joint Analysis of Motion Parts and Attributes from 3D Shapes
Articulation estimation sometimes combines with object detection, pose estimation, tracking, or reconstruction.
- Object Detection
- Reconstruction
- CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
- SAOR: Single-View Articulated Object Reconstruction
Non-Deeplearning Methods
Before deep learning becomes popular, there are some methods focusing on solving this articulation estimation task.
CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects
Deeplearning Methods
Deeplearning Methods
Semi-Weakly Supervised Object Kinematic Motion Prediction
Towards Understanding Articulated Objects
Robots operating in home environments must be able to interact with articulated objects such as doors or drawers. Ideally, robots are able to autonomously infer articulation models by observation. In this paper, we present an approach to learn kinematic models by inferring the connectivity of rigid parts and the articulation models for the corresponding links. Our method uses a mixture of parameterized and parameter-free representations. To obtain parameter-free models, we seek for low-dimensional manifolds of latent action variables in order to provide the best explanation of the given observations. The mapping from the constrained manifold of an articulated link to the work space is learned by means of Gaussian process regression. Our approach has been implemented and evaluated using real data obtained in various home environment settings. Finally, we discuss the limitations and possible extensions of the proposed method.
Interactive Segmentation, Tracking, and Kinematic Modeling of Unknown Articulated Objects
We present an interactive perceptual skill for segmenting, tracking, and kinematic modeling of 3D articulated objects. This skill is a prerequisite for general manipulation in unstructured environments. Robot-environment interaction is used to move an unknown object, creating a perceptual signal that reveals the kinematic properties of the object. The resulting perceptual information can then inform and facilitate further manipulation. The algorithm is computationally efficient, handles occlusion, and depends on little object motion; it only requires sufficient texture for visual feature tracking. We conducted experiments with everyday objects on a mobile manipulation platform equipped with an RGB-D sensor. The results demonstrate the robustness of the proposed method to lighting conditions, object appearance, size, structure, and configuration.
Reconstructing Articulated Rigged Models from RGB-D Videos
Although commercial and open-source software exist to reconstruct a static object from a sequence recorded with an RGB-D sensor, there is a lack of tools that build rigged models of articulated objects that deform realistically and can be used for tracking or animation. In this work, we fill this gap and propose a method that creates a fully rigged model of an articulated object from depth data of a single sensor. To this end, we combine deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow. The fully rigged model then consists of a watertight mesh, embedded skeleton, and skinning weights.
The RBO Dataset of Articulated Objects and Interactions
We present a dataset with models of 14 articulated objects commonly found in human environments and with RGBD video sequences and wrenches recorded of human interactions with them. The 358 interaction sequences total 67 minutes of human manipulation under varying experimental conditions (type of interaction, lighting, perspective, and background). Each interaction with an object is annotated with the ground truth poses of its rigid parts and the kinematic state obtained by a motion capture system. For a subset of 78 sequences (25 minutes), we also measured the interaction wrenches. The object models contain textured three-dimensional triangle meshes of each link and their motion constraints. We provide Python scripts to download and visualize the data.
Deep Part Induction from Articulated Object Pairs
Object functionality is often expressed through part articulation – as when the two rigid parts of a scissor pivot against each other to perform the cutting function. Such articulations are often similar across objects within the same functional category. In this paper we explore how the observation of different articulation states provides evidence for part structure and motion of 3D objects. Our method takes as input a pair of unsegmented shapes representing two different articulation states of two functionally related objects, and induces their common parts along with their underlying rigid motion. This is a challenging setting, as we assume no prior shape structure, no prior shape category information, no consistent shape orientation, the articulation states may belong to objects of different geometry, plus we allow inputs to be noisy and partial scans, or point clouds lifted from RGB images. Our method learns a neural network architecture with three modules that respectively propose correspondences, estimate 3D deformation flows, and perform segmentation. To achieve optimal performance, our architecture alternates between correspondence, deformation flow, and segmentation prediction iteratively in an ICP-like fashion. Our results demonstrate that our method significantly outperforms state-of-the-art techniques in the task of discovering articulated parts of objects. In addition, our part induction is object-class agnostic and successfully generalizes to new and unseen objects.
Deep Learning Based Robotic Tool Detection and Articulation Estimation With Spatio-Temporal Layers
Surgical-tool joint detection from laparoscopic images is an important but challenging task in computer-assisted minimally invasive surgery. Illumination levels, variations in background and the different number of tools in the field of view, all pose difficulties to algorithm and model training. Yet, such challenges could be potentially tackled by exploiting the temporal information in laparoscopic videos to avoid per frame handling of the problem. In this letter, we propose a novel encoder–decoder architecture for surgical instrument joint detection and localization that uses three-dimensional convolutional layers to exploit spatio-temporal features from laparoscopic videos. When tested on benchmark and custom-built datasets, a median Dice similarity coefficient of 85.1% with an interquartile range of 4.6% highlights performance better than the state of the art based on single-frame processing. Alongside novelty of the network architecture, the idea for inclusion of temporal information appears to be particularly useful when processing images with unseen backgrounds during the training phase, which indicates that spatio-temporal features for joint detection help to generalize the solution.
Learning to Generalize Kinematic Models to Novel Objects
Robots operating in human environments must be capable of interacting with a wide variety of articulated objects such as cabinets, refrigerators, and drawers. Existing approaches require human demonstration or minutes of interaction to fit kinematic models to each novel object from scratch. We present a framework for estimating the kinematic model and configuration of previously unseen articulated objects, conditioned upon object type, from as little as a single observation. We train our system in simulation with a novel dataset of synthetic articulated objects; at runtime, our model can predict the shape and kinematic model of an object from depth sensor data. We demonstrate that our approach enables a MOVO robot to view an object with its RGB-D sensor, estimate its motion model, and use that estimate to interact with the object.
RPM-Net: Recurrent Prediction of Motion and Parts from Point Cloud
We introduce RPM-Net, a deep learning-based approach which simultaneously infers movable parts and hallucinates their motions from a single, un-segmented, and possibly partial, 3D point cloud shape. RPM-Net is a novel Recurrent Neural Network (RNN), composed of an encoder-decoder pair with interleaved Long Short-Term Memory (LSTM) components, which together predict a temporal sequence of pointwise displacements for the input point cloud. At the same time, the displacements allow the network to learn movable parts, resulting in a motion-based shape segmentation. Recursive applications of RPM-Net on the obtained parts can predict finer-level part motions, resulting in a hierarchical object segmentation. Furthermore, we develop a separate network to estimate part mobilities, e.g., per-part motion parameters, from the segmented motion sequence. Both networks learn deep predictive models from a training set that exemplifies a variety of mobilities for diverse objects. We show results of simultaneous motion and part predictions from synthetic and real scans of 3D objects exhibiting a variety of part mobilities, possibly involving multiple movable parts.
A Hand Motion-guided Articulation and Segmentation Estimation
we present a method for simultaneous articulation model estimation and segmentation of an articulated object in RGB-D images using human hand motion. Our method uses the hand motion in the processes of the initial articulation model estimation, ICP-based model parameter optimization, and region selection of the target object. The hand motion gives an initial guess of the articulation model: prismatic or revolute joint. The method estimates the joint parameters by aligning the RGB-D images with the constraint of the hand motion. Finally, the target regions are selected from the cluster regions which move symmetrically along with the articulation model. Our experimental results show the robustness of the proposed method for the various objects.
Category-Level Articulated Object Pose Estimation
This paper addresses the task of category-level pose estimation for articulated objects from a single depth image. We present a novel category-level approach that correctly accommodates object instances previously unseen during training. We introduce Articulation-aware Normalized Coordinate Space Hierarchy (ANCSH) – a canonical representation for different articulated objects in a given category. As the key to achieve intra-category generalization, the representation constructs a canonical object space as well as a set of canonical part spaces. The canonical object space normalizes the object orientation, scales and articulations (e.g. joint parameters and states) while each canonical part space further normalizes its part pose and scale. We develop a deep network based on PointNet++ that predicts ANCSH from a single depth point cloud, including part segmentation, normalized coordinates, and joint parameters in the canonical object space. By leveraging the canonicalized joints, we demonstrate: (1) improved performance in part pose and scale estimations using the induced kinematic constraints from joints; (2) high accuracy for joint parameter estimation in camera space.