We uniformly divide the video clip into 10 parts, choose center of each segments as start point and then pick 32 consecutive frames from those start points to form the inference segments. The conv evaluation inference is performed on 10 segments out of a video clip. For example, if the model requires 32 frames as input and a video clip has 128 frames, then we will choose the frames from index 48 to index 79 to do the inference. The center evaluation inference is performed on the middle part of frames in the video clip. The key performance indicator is the accuracy of action recognition.
The videos are also diversed by visible body parts/camera motion/camera viewpoint/number of people involved in the action/video quality. The evaluation dataset are obtained by randomly collecting 10% video per class out of HMDB5. Each of classes directory will contain multiple video clips folders which contain the corresponding RGB frames (rgb), optical flow x-axis grayscale images (u), and optical flow y-axis grayscale images (v). The dataset should be divided into different directory by classes. TAO toolkit support training ActionRecognitionNet with RGB input or optical flow input. The data format must be in the following format. Video size: most of videos are in 320x240 Number of people involved in the action: single, two, three
Visible body parts: upper body, full body, lower bodyĬamera view point: front, back, left, right The training videos are varied in visible body parts, camera motion, camera viewpoint, number of people involved in the action and video quality. We pick videos of walk, ride_bike, run, fall_floor and push out of HMDB51 to form HMDB5. The models are trained on a subset of HMDB51. The training algorithm optimizes the network to minimize the cross entropy loss for classification. They will take a sequence of RGB frames or optical flow gray images as input and predict the action label of those frames. Model Architectureīoth 2D and 3D models are with ResNet-style backbone. Both models are trained on a subset of HMDB51. And there are also three 3D models with the same input type as the 2D models. Six pretrained ActionRecognitionNet models are delivered - Three 2D models which are trained with RGB, optical flow generated on A100 with NVOF SDK and optical flow generated on Jetson Xavier with VPI respectively. The model described in this card is action recognition network, which aims to recognize what people do in videos. ActionRecognitionNet Model Card Model Overview