API Reference¶

mmaction.apis¶

mmaction.apis.inference_recognizer(model, video_path, label_path, use_frames=False, outputs=None, as_tensor=True)[source]¶

Inference a video with the detector.

Parameters

model (nn.Module) – The loaded recognizer.
video_path (str) – The video file path/url or the rawframes directory path. If use_frames is set to True, it should be rawframes directory path. Otherwise, it should be video file path.
label_path (str) – The label file path.
use_frames (bool) – Whether to use rawframes as input. Default:False.
outputs (list(str) | tuple(str) | str | None) – Names of layers whose outputs need to be returned, default: None.
as_tensor (bool) – Same as that in OutputHook. Default: True.

Returns

Top-5 recognition result dict. dict[torch.tensor | np.ndarray]:

Output feature maps from layers specified in outputs.

Return type

dict[tuple(str, float)]

mmaction.apis.init_recognizer(config, checkpoint=None, device='cuda:0', use_frames=False)[source]¶

Initialize a recognizer from config file.

Parameters

config (str | mmcv.Config) – Config file path or the config object.
checkpoint (str | None, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Default: None.
device (str | torch.device) – The desired device of returned tensor. Default: ‘cuda:0’.
use_frames (bool) – Whether to use rawframes as input. Default:False.

Returns

The constructed recognizer.

Return type

nn.Module

mmaction.apis.multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=True)[source]¶

Test model with multiple gpus.

This method tests model with multiple gpus and collects the results under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’ it encodes results to gpu tensors and use gpu communication for results collection. On cpu mode it saves the results on different gpus to ‘tmpdir’ and collects them by the rank 0 worker.

Parameters

model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
tmpdir (str) – Path of directory to save the temporary results from different gpus under cpu mode. Default: None
gpu_collect (bool) – Option to use either gpu or cpu to collect results. Default: True

Returns

The prediction results.

Return type

list

mmaction.apis.single_gpu_test(model, data_loader)[source]¶

Test model with a single gpu.

This method tests model with a single gpu and displays test progress bar.

Parameters

model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.

Returns

The prediction results.

Return type

list

mmaction.apis.train_model(model, dataset, cfg, distributed=False, validate=False, test={'test_best': False, 'test_last': False}, timestamp=None, meta=None)[source]¶

Train model entry function.

Parameters

model (nn.Module) – The model to be trained.
dataset (Dataset) – Train dataset.
cfg (dict) – The config dict for training.
distributed (bool) – Whether to use distributed training. Default: False.
validate (bool) – Whether to do evaluation. Default: False.
test (dict) – The testing option, with two keys: test_last & test_best. The value is True or False, indicating whether to test the corresponding checkpoint. Default: dict(test_best=False, test_last=False).
timestamp (str | None) – Local time for runner. Default: None.
meta (dict | None) – Meta dict to record some important information. Default: None

mmaction.core¶

optimizer¶

class mmaction.core.optimizer.CopyOfSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]¶

A clone of torch.optim.SGD.

A customized optimizer could be defined like CopyOfSGD. You may derive from built-in optimizers in torch.optim, or directly implement a new optimizer.

class mmaction.core.optimizer.TSMOptimizerConstructor(optimizer_cfg, paramwise_cfg=None)[source]¶

Optimizer constructor in TSM model.

This constructor builds optimizer in different ways from the default one.

Parameters of the first conv layer have default lr and weight decay.
Parameters of BN layers have default lr and zero weight decay.
If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.
Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.

add_params(params, model)[source]¶

Add parameters and their corresponding lr and wd to the params.

Parameters

params (list) – The list to be modified, containing all parameter groups and their corresponding lr and wd configurations.
model (nn.Module) – The model to be trained with the optimizer.

evaluation¶

class mmaction.core.evaluation.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[source]¶

Class to evaluate detection results on ActivityNet.

Parameters

ground_truth_filename (str | None) – The filename of groundtruth. Default: None.
prediction_filename (str | None) – The filename of action detection results. Default: None.
tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default: np.linspace(0.5, 0.95, 10).
verbose (bool) – Whether to print verbose logs. Default: False.

evaluate()[source]¶

Evaluates a prediction file.

For the detection task we measure the interpolated mean average precision to measure the performance of a method.

wrapper_compute_average_precision()[source]¶: Computes average precision for each class.

class mmaction.core.evaluation.DistEvalHook(dataloader, start=None, interval=1, by_epoch=True, save_best='auto', rule=None, broadcast_bn_buffer=True, tmpdir=None, gpu_collect=False, **eval_kwargs)[source]¶

Distributed evaluation hook.

This hook will regularly perform evaluation in a given interval when performing in distributed environment.

Parameters

dataloader (DataLoader) – A PyTorch dataloader.
start (int | None, optional) – Evaluation starting epoch. It enables evaluation before the training starts if start <= the resuming epoch. If None, whether to evaluate is merely decided by interval. Default: None.
interval (int) – Evaluation interval. Default: 1.
by_epoch (bool) – Determine perform evaluation by epoch or by iteration. If set to True, it will perform by epoch. Otherwise, by iteration. default: True.
save_best (str | None, optional) –
If a metric is specified, it would measure the best checkpoint during evaluation. The information about best checkpoint would be save in best.json. Options are the evaluation metrics to the test dataset. e.g.,

top1_acc, top5_acc, mean_class_accuracy,

mean_average_precision, mmit_mean_average_precision for action recognition dataset (RawframeDataset and VideoDataset). AR@AN, auc for action localization dataset (ActivityNetDataset). mAP@0.5IOU for spatio-temporal action detection dataset (AVADataset). If save_best is auto, the first key of the returned OrderedDict result will be used. Default: ‘auto’.
rule (str | None, optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. Options are ‘greater’, ‘less’, None. Default: None.
tmpdir (str | None) – Temporary directory to save the results of all processes. Default: None.
gpu_collect (bool) – Whether to use gpu or cpu to collect results. Default: False.
broadcast_bn_buffer (bool) – Whether to broadcast the buffer(running_mean and running_var) of rank 0 to other rank before evaluation. Default: True.
**eval_kwargs – Evaluation arguments fed into the evaluate function of the dataset.

class mmaction.core.evaluation.EvalHook(dataloader, start=None, interval=1, by_epoch=True, save_best='auto', rule=None, **eval_kwargs)[source]¶

Non-Distributed evaluation hook.

Notes

If new arguments are added for EvalHook, tools/test.py, tools/eval_metric.py may be effected.

This hook will regularly perform evaluation in a given interval when performing in non-distributed environment.

Parameters

dataloader (DataLoader) – A PyTorch dataloader.
start (int | None, optional) – Evaluation starting epoch. It enables evaluation before the training starts if start <= the resuming epoch. If None, whether to evaluate is merely decided by interval. Default: None.
interval (int) – Evaluation interval. Default: 1.
by_epoch (bool) – Determine perform evaluation by epoch or by iteration. If set to True, it will perform by epoch. Otherwise, by iteration. default: True.
save_best (str | None, optional) –
If a metric is specified, it would measure the best checkpoint during evaluation. The information about best checkpoint would be save in best.json. Options are the evaluation metrics to the test dataset. e.g.,

top1_acc, top5_acc, mean_class_accuracy,

mean_average_precision, mmit_mean_average_precision for action recognition dataset (RawframeDataset and VideoDataset). AR@AN, auc for action localization dataset. (ActivityNetDataset). mAP@0.5IOU for spatio-temporal action detection dataset (AVADataset). If save_best is auto, the first key of the returned OrderedDict result will be used. Default: ‘auto’.
rule (str | None, optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. Options are ‘greater’, ‘less’, None. Default: None.
**eval_kwargs – Evaluation arguments fed into the evaluate function of the dataset.

after_train_epoch(runner)[source]¶: Called after every training epoch to evaluate the results.

after_train_iter(runner)[source]¶: Called after every training iter to evaluate the results.

before_train_epoch(runner)[source]¶: Evaluate the model only at the start of training by epoch.

before_train_iter(runner)[source]¶: Evaluate the model only at the start of training by iteration.

evaluate(runner, results)[source]¶

Evaluate the results.

Parameters

runner (mmcv.Runner) – The underlined training runner.
results (list) – Output results.

evaluation_flag(runner)[source]¶

Judge whether to perform_evaluation.

Returns: The flag indicating whether to perform evaluation.
Return type: bool

mmaction.core.evaluation.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]¶

Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.

Parameters

ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.
prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

Returns

1D array of average precision score.

Return type

np.ndarray

mmaction.core.evaluation.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]¶

Computes the average recall given an average number (percentile) of proposals per video.

Parameters

ground_truth (dict) – Dict containing the ground truth instances.
proposals (dict) – Dict containing the proposal instances.
total_num_proposals (int) – Total number of proposals in the proposal dict.
max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default: np.linspace(0.5, 0.95, 10).

Returns

(recall, average_recall, proposals_per_video, auc) In recall, recall[i,j] is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent to recall.mean(axis=0). The proposals_per_video is the average number of proposals per video. The auc is the area under AR@AN curve.

Return type

tuple([np.ndarray, np.ndarray, np.ndarray, float])

mmaction.core.evaluation.confusion_matrix(y_pred, y_real, normalize=None)[source]¶

Compute confusion matrix.

Parameters

y_pred (list[int] | np.ndarray[int]) – Prediction labels.
y_real (list[int] | np.ndarray[int]) – Ground truth labels.
normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.

Returns

Confusion matrix.

Return type

np.ndarray

mmaction.core.evaluation.get_weighted_score(score_list, coeff_list)[source]¶

Get weighted score with given scores and coefficients.

Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n

Parameters

score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes
coeff_list (list[float]) – List of coefficients, with shape n.

Returns

List of weighted scores.

Return type

list[np.ndarray]

mmaction.core.evaluation.interpolated_precision_recall(precision, recall)[source]¶

Interpolated AP - VOCdevkit from VOC 2011.

Parameters

precision (np.ndarray) – The precision of different thresholds.
recall (np.ndarray) – The recall of different thresholds.

Returns：: float: Average precision score.

mmaction.core.evaluation.mean_average_precision(scores, labels)[source]¶

Mean average precision for multi-label recognition.

Parameters

scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

Returns

The mean average precision.

Return type

np.float

mmaction.core.evaluation.mean_class_accuracy(scores, labels)[source]¶

Calculate mean class accuracy.

Parameters

scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.

Returns

Mean class accuracy.

Return type

np.ndarray

mmaction.core.evaluation.mmit_mean_average_precision(scores, labels)[source]¶

Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.

Parameters

scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.

Returns

The MMIT style mean average precision.

Return type

np.float

mmaction.core.evaluation.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[source]¶

Compute intersection over union between segments.

Parameters

candidate_segments (np.ndarray) – 1-dim/2-dim array in format [init, end]/[m x 2:=[init, end]].
target_segments (np.ndarray) – 2-dim array in format [n x 2:=[init, end]].
calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.

Returns

1-dim array [n] /: 2-dim array [n x m] with IoU ratio.
t_overlap_self (np.ndarray, optional): 1-dim array [n] /: 2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.

Return type

t_iou (np.ndarray)

mmaction.core.evaluation.softmax(x, dim=1)[source]¶: Compute softmax values for each sets of scores in x.

mmaction.core.evaluation.top_k_accuracy(scores, labels, topk=(1))[source]¶

Calculate top k accuracy score.

Parameters

scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).

Returns

Top k accuracy score for each k.

Return type

list[float]

lr¶

class mmaction.core.lr.TINLrUpdaterHook(min_lr, **kwargs)[source]¶

mmaction.localization¶

localization¶

mmaction.localization.eval_ap(detections, gt_by_cls, iou_range)[source]¶

Evaluate average precisions.

Parameters

detections (dict) – Results of detections.
gt_by_cls (dict) – Information of groudtruth.
iou_range (list) – Ranges of iou.

Returns

Average precision values of classes at ious.

Return type

list

mmaction.localization.generate_bsp_feature(video_list, video_infos, tem_results_dir, pgm_proposals_dir, top_k=1000, bsp_boundary_ratio=0.2, num_sample_start=8, num_sample_end=8, num_sample_action=16, num_sample_interp=3, tem_results_ext='.csv', pgm_proposal_ext='.csv', result_dict=None)[source]¶

Generate Boundary-Sensitive Proposal Feature with given proposals.

Parameters

video_list (list[int]) – List of video indexs to generate bsp_feature.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’.
tem_results_dir (str) – Directory to load temporal evaluation results.
pgm_proposals_dir (str) – Directory to load proposals.
top_k (int) – Number of proposals to be considered. Default: 1000
bsp_boundary_ratio (float) – Ratio for proposal boundary (start/end). Default: 0.2.
num_sample_start (int) – Num of samples for actionness in start region. Default: 8.
num_sample_end (int) – Num of samples for actionness in end region. Default: 8.
num_sample_action (int) – Num of samples for actionness in center region. Default: 16.
num_sample_interp (int) – Num of samples for interpolation for each sample point. Default: 3.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
pgm_proposal_ext (str) – File extension for proposals. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.

Returns

A dict contains video_name as keys and: bsp_feature as value. If result_dict is not None, save the results to it.

Return type

bsp_feature_dict (dict)

mmaction.localization.generate_candidate_proposals(video_list, video_infos, tem_results_dir, temporal_scale, peak_threshold, tem_results_ext='.csv', result_dict=None)[source]¶

Generate Candidate Proposals with given temporal evalutation results. Each proposal file will contain: ‘tmin,tmax,tmin_score,tmax_score,score,match_iou,match_ioa’.

Parameters

video_list (list[int]) – List of video indexs to generate proposals.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’, ‘duration_frame’, ‘duration_second’, ‘feature_frame’, and ‘annotations’.
tem_results_dir (str) – Directory to load temporal evaluation results.
temporal_scale (int) – The number (scale) on temporal axis.
peak_threshold (float) – The threshold for proposal generation.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.

Returns

A dict contains video_name as keys and proposal list as value.: If result_dict is not None, save the results to it.

Return type

dict

mmaction.localization.load_localize_proposal_file(filename)[source]¶

Load the proposal file and split it into many parts which contain one video’s information separately.

Parameters: filename (str) – Path to the proposal file.
Returns: List of all videos’ information.
Return type: list

mmaction.localization.perform_regression(detections)[source]¶

Perform regression on detection results.

Parameters: detections (list) – Detection results before regression.
Returns: Detection results after regression.
Return type: list

mmaction.localization.soft_nms(proposals, alpha, low_threshold, high_threshold, top_k)[source]¶

Soft NMS for temporal proposals.

Parameters

proposals (np.ndarray) – Proposals generated by network.
alpha (float) – Alpha value of Gaussian decaying function.
low_threshold (float) – Low threshold for soft nms.
high_threshold (float) – High threshold for soft nms.
top_k (int) – Top k values to be considered.

Returns

The updated proposals.

Return type

np.ndarray

mmaction.localization.temporal_iop(proposal_min, proposal_max, gt_min, gt_max)[source]¶

Compute IoP score between a groundtruth bbox and the proposals.

Compute the IoP which is defined as the overlap ratio with groundtruth proportional to the duration of this proposal.

Parameters

proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.

Returns

List of intersection over anchor scores.

Return type

list[float]

mmaction.localization.temporal_iou(proposal_min, proposal_max, gt_min, gt_max)[source]¶

Compute IoU score between a groundtruth bbox and the proposals.

Parameters

proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.

Returns

List of iou scores.

Return type

list[float]

mmaction.localization.temporal_nms(detections, threshold)[source]¶

Parse the video’s information.

Parameters

detections (list) – Detection results before NMS.
threshold (float) – Threshold of NMS.

Returns

Detection results after NMS.

Return type

list

mmaction.models¶

models¶

class mmaction.models.AudioRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶

Audio recognizer model framework.

forward(audios, label=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_gradcam(audios)[source]¶: Defines the computation performed at every all when using gradcam utils.

forward_test(audios)[source]¶: Defines the computation performed at every call when evaluation and testing.

forward_train(audios, labels)[source]¶: Defines the computation performed at every call when training.

train_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters

data_batch (dict) – The output of dataloader.
optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,: num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶

Classification head for TSN on audio.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]¶

Simplest RoI head, with only two fc layers for classification and regression respectively.

Parameters

temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
in_channels (int) – The number of input channels. Default: 2048.
num_classes (int) – The number of classes. Default: 81.
dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.
dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.
topk (int or tuple[int]) – Parameter for evaluating multilabel accuracy. Default: (3, 5)
multilabel (bool) – Whether used for a multilabel task. Default: True. (Only support multilabel == True now).

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

recall_prec(pred_vec, target_vec)[source]¶

Parameters

pred_vec (tensor[N x C]) – each element is either 0 or 1
target_vec (tensor[N x C]) – each element is either 0 or 1

class mmaction.models.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]¶

Binary Cross Entropy Loss with logits.

Parameters

loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]¶

Boundary Matching Network for temporal action proposal generation.

Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network

Parameters

temporal_dim (int) – Total frames selected for each video.
boundary_ratio (float) – Ratio for determining video boundaries.
num_samples (int) – Number of samples for each proposal.
num_samples_per_bin (int) – Number of bin samples for each sample.
feat_dim (int) – Feature dimension.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
loss_cls (dict) – Config for building loss. Default: dict(type='BMNLoss').
hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]¶: Define the computation performed at every call when testing.

forward_train(raw_feature, label_confidence, label_start, label_end)[source]¶: Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]¶: Generate training labels.

class mmaction.models.BMNLoss[source]¶

BMN Loss.

From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of

1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.

forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]¶

Calculate Boundary Matching Network Loss.

Parameters

pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.

Returns

(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.

Return type

tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])

static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]¶

Calculate Proposal Evaluation Module Classification Loss.

Parameters

pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5

Returns

Proposal evalutaion classification loss.

Return type

torch.Tensor

static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]¶

Calculate Proposal Evaluation Module Regression Loss.

Parameters

pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.

Returns

Proposal evalutaion regression loss.

Return type

torch.Tensor

static tem_loss(pred_start, pred_end, gt_start, gt_end)[source]¶

Calculate Temporal Evaluation Module Loss.

This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.

Parameters

pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0)[source]¶

Base class for head.

All Head should subclass it. All subclass should overwrite: - Methods:init_weights, initializing weights in some modules. - Methods:forward, supporting to forward both for training and testing.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.

abstract forward(x)[source]¶: Defines the computation performed at every call.

abstract init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

loss(cls_score, labels, **kwargs)[source]¶

Calculate the loss given output cls_score, target labels.

Parameters

cls_score (torch.Tensor) – The output of the model.
labels (torch.Tensor) – The target output of the model.

Returns

A dict containing field ‘loss_cls’(mandatory) and ‘top1_acc’, ‘top5_acc’(optional).

Return type

dict

class mmaction.models.BaseRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶

Base class for recognizers.

All recognizers should subclass it. All subclass should overwrite:

Methods:forward_train, supporting to forward when training.
Methods:forward_test, supporting to forward when testing.

Parameters

backbone (dict) – Backbone modules to extract feature.
cls_head (dict) – Classification head to process feature.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.

average_clip(cls_score, num_segs=1)[source]¶

Averaging class score over multiple clips.

Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.

Parameters

cls_score (torch.Tensor) – Class score to be averaged.
num_segs (int) – Number of clips for each input sample.

Returns

Averaged class score.

Return type

torch.Tensor

extract_feat(imgs)[source]¶

Extract features through a backbone.

Parameters: imgs (torch.Tensor) – The input images.
Returns: The extracted features.
Return type: torch.tensor

forward(imgs, label=None, return_loss=True, **kwargs)[source]¶: Define the computation performed at every call.

abstract forward_gradcam(imgs)[source]¶: Defines the computation performed at every all when using gradcam utils.

abstract forward_test(imgs)[source]¶: Defines the computation performed at every call when evaluation and testing.

abstract forward_train(imgs, labels, **kwargs)[source]¶: Defines the computation performed at every call when training.

init_weights()[source]¶: Initialize the model network weights.

train_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters

data_batch (dict) – The output of dataloader.
optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,: num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_neck¶

whether the detector has a neck

Type: bool

class mmaction.models.BinaryLogisticRegressionLoss[source]¶

Binary Logistic Regression Loss.

It will calculate binary logistic regression loss given reg_score and label.

forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]¶

Calculate Binary Logistic Regression Loss.

Parameters

reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, dropout_ratio=0.5, init_std=0.005)[source]¶

C3D backbone.

Parameters

pretrained (str | None) – Name of pretrained model.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses dict(type='Conv3d') to construct layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. required keys are type, Default: None.
act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses dict(type='ReLU') to construct layers. Default: None.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data. the size of x is (num_batches, 3, 16, 112, 112).
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]¶

(2+1)d Conv module for R(2+1)d backbone.

https://arxiv.org/pdf/1711.11248.pdf.

Parameters

in_channels (int) – Same as nn.Conv3d.
out_channels (int) – Same as nn.Conv3d.
kernel_size (int | tuple[int]) – Same as nn.Conv3d.
stride (int | tuple[int]) – Same as nn.Conv3d.
padding (int | tuple[int]) – Same as nn.Conv3d.
dilation (int | tuple[int]) – Same as nn.Conv3d.
groups (int) – Same as nn.Conv3d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The output of the module.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]¶

Conv2d module for AudioResNet backbone.

<https://arxiv.org/abs/2001.08740>`_.

Parameters

in_channels (int) – Same as nn.Conv2d.
out_channels (int) – Same as nn.Conv2d.
kernel_size (int | tuple[int]) – Same as nn.Conv2d.
op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.
stride (int | tuple[int]) – Same as nn.Conv2d.
padding (int | tuple[int]) – Same as nn.Conv2d.
dilation (int | tuple[int]) – Same as nn.Conv2d.
groups (int) – Same as nn.Conv2d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The output of the module.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]¶

Cross Entropy Loss.

Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of cls_score and label. 1) Hard label: This label is an integer array and all of the elements are

in the range [0, num_classes - 1]. This label’s shape should be cls_score’s shape with the num_classes dimension removed.

Soft label(probablity distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as cls_score. For now, only 2-dim soft label is supported.

Parameters

loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max')[source]¶

Feature Bank Operator Head.

Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.

Parameters

lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.
fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights(pretrained=None)[source]¶

Initialize the weights in the module.

Parameters: pretrained (str, optional) – Path to pre-trained weights. Default: None.

sample_lfb(rois, img_metas)[source]¶: Sample long-term features for each ROI feature.

class mmaction.models.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]¶

Calculate the BCELoss for HVU.

Parameters

categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.

class mmaction.models.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]¶

Classification head for I3D.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]¶

Long-Term Feature Bank (LFB).

LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding

The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.

Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:

Parameters

lfb_prefix_path (str) – The storage path of lfb.
max_num_sampled_feat (int) – The max number of sampled features. Default: 5.
window_size (int) – Window size of sampling long term feature. Default: 60.
lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.
dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).
device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.
lmdb_map_size (int) – Map size of lmdb. Default: 4e9.
construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.

class mmaction.models.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max')[source]¶

Long-Term Feature Bank Infer Head.

This head is used to derive and save the LFB without affecting the input.

Parameters

lfb_prefix_path (str) – The prefix path to store the lfb.
dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.
use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]¶

MobileNetV2 backbone.

Parameters

pretrained (str | None) – Name of pretrained model. Default: None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.
conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_layer(out_channels, num_blocks, stride, expand_ratio)[source]¶

Stack InvertedResidual blocks to build a layer for MobileNetV2.

Parameters

out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Default: 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.

train(mode=True)[source]¶

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters: mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.
Returns: self
Return type: Module

class mmaction.models.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]¶

MobileNetV2 backbone for TSM.

Parameters

num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
shift_div (int) – Number of div for shift. Default: 8.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_shift()[source]¶: Make temporal shift for some layers.

class mmaction.models.NLLLoss(loss_weight=1.0)[source]¶

NLL Loss.

It will calculate NLL loss given cls_score and label.

class mmaction.models.OHEMHingeLoss[source]¶

This class is the core implementation for the completeness loss in paper.

It compute class-wise hinge loss and performs online hard example mining (OHEM).

static backward(ctx, grad_output)[source]¶

Defines a formula for differentiating the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs did forward() return, and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]¶

Calculate OHEM hinge loss.

Parameters

pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.

Returns

Returned class-wise hinge loss.

Return type

torch.Tensor

class mmaction.models.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]¶

Proposals Evaluation Model for Boundary Sensetive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters

pem_feat_dim (int) – Feature dimension.
pem_hidden_dim (int) – Hidden layer dimension.
pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.

forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_test(bsp_feature, tmin, tmax, tmin_score, tmax_score, video_meta)[source]¶: Define the computation performed at every call when testing.

forward_train(bsp_feature, reference_temporal_iou)[source]¶: Define the computation performed at every call when training.

class mmaction.models.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]¶

ResNet backbone.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
dilations (Sequence[int]) – Dilation of each stage.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
partial_bn (bool) – Whether to use partial bn. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.ResNet2Plus1d(*args, **kwargs)[source]¶

ResNet (2+1)d backbone.

This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

class mmaction.models.ResNet3d(depth, pretrained, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(5, 7, 7), conv1_stride_t=2, pool1_stride_t=2, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]¶

ResNet 3d backbone.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
in_channels (int) – Channel num of input features. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).
temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default: (1, 1, 1, 1).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (5, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 2.
with_pool2 (bool) – Whether to use pool2. Default: True.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.
conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).
act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default: dict().
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

static make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶

Build residual layer for ResNet3D.

Parameters

block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.
dilation (int) – Spacing between kernel elements. Default: 1.
style (str) – pytorch or caffe. If set to pytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.
inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.
non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.
non_local_cfg (dict) – Config for non-local module. Default: dict().
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]¶

ResNet backbone for CSN.

Parameters

depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.

If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.

Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶

ResNet 3d Layer.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
stage (int) – The index of Resnet stage. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
spatial_stride (int) – The 1st res block’s spatial stride. Default 2.
temporal_stride (int) – The 1st res block’s temporal stride. Default 1.
dilation (int) – The dilation. Default: 1.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
all_frozen (bool) – Frozen all modules in the layer. Default: False.
inflate (int) – Inflate Dims of each block. Default: 1.
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.
conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).
act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]¶

Slowfast backbone.

This module is proposed in SlowFast Networks for Video Recognition

Parameters

pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride resample_rate on input frames. The actual resample rate is calculated by multipling the interval in SampleFrames in the pipeline with resample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out of resample_rate * interval frames. Default: 8.
speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.
channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Default: 8.
slow_pathway (dict) –
Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:
```
dict(type='ResNetPathway',
lateral=True, depth=50, pretrained=None,
conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1),
conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
```

fast_pathway (dict) –

Configuration of fast branch, similar to slow_pathway. Default:

dict(type='ResNetPathway',
lateral=False, depth=50, pretrained=None, base_channels=8,
conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)

forward(x)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted: by the backbone.

Return type

tuple[torch.Tensor]

init_weights(pretrained=None)[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]¶

SlowOnly backbone based on ResNet3dPathway.

Parameters

*args (arguments) – Arguments same as ResNet3dPathway.
conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).
**kwargs (keyword arguments) – Keywords arguments for ResNet3dPathway.

class mmaction.models.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]¶

ResNet 2d audio backbone. Reference:

<https://arxiv.org/abs/2001.08740>`_.

Parameters

depth (int) – Depth of resnet, from {50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
in_channels (int) – Channel num of input features. Default: 1.
base_channels (int) – Channel num of stem output features. Default: 32.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.
conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]¶

Build residual layer for ResNetAudio.

Parameters

block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilation (int) – Spacing between kernel elements. Default: 1.
factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]¶

ResNet backbone for TIN.

Parameters

depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_interlace()[source]¶: Make temporal interlace for some layers.

class mmaction.models.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]¶

ResNet backbone for TSM.

Parameters

num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default: dict().
shift_div (int) – Number of div for shift. Default: 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Default: False.
**kwargs (keyword arguments, optional) – Arguments for ResNet.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_pool()[source]¶: Make temporal pooling between layer1 and layer2, using a 3D max pooling layer.

make_temporal_shift()[source]¶: Make temporal shift for some layers.

class mmaction.models.SSNLoss[source]¶

static activity_loss(activity_score, labels, activity_indexer)[source]¶

Activity Loss.

It will calculate activity loss given activity_score and label.

Args：: activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.

Returns: Returned cross entropy loss.
Return type: torch.Tensor

static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]¶

Classwise Regression Loss.

It will calculate classwise_regression loss given class_reg_pred and targets.

Args：

bbox_pred (torch.Tensor): Predicted interval center and span: of positive proposals.

labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span

of positive proposals.

regression_indexer (torch.Tensor): Index slices of: positive proposals.

Returns: Returned class-wise regression loss.
Return type: torch.Tensor

static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]¶

Completeness Loss.

It will calculate completeness loss given completeness_score and label.

Args：

completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and

incomplete proposals.

positive_per_video (int): Number of positive proposals sampled: per video.
incomplete_per_video (int): Number of incomplete proposals sampled: pre video.
ohem_ratio (float): Ratio of online hard example mining.: Default: 0.17.

Returns: Returned class-wise completeness loss.
Return type: torch.Tensor

forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]¶

Calculate Boundary Matching Network Loss.

Parameters

activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.

Returns

(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.

Return type

dict([torch.Tensor, torch.Tensor, torch.Tensor])

class mmaction.models.SingleRoIExtractor3D(roi_layer_type='RoIAlign', featmap_stride=16, output_size=16, sampling_ratio=0, pool_mode='avg', aligned=True, with_temporal_pool=True, with_global=False)[source]¶

Extract RoI features from a single level feature map.

Parameters

roi_layer_type (str) – Specify the RoI layer type. Default: ‘RoIAlign’.
featmap_stride (int) – Strides of input feature maps. Default: 16.
output_size (int | tuple) – Size or (Height, Width). Default: 16.
sampling_ratio (int) – number of inputs samples to take for each output sample. 0 to take samples densely for current models. Default: 0.
pool_mode (str, 'avg' or 'max') – pooling mode in each bin. Default: ‘avg’.
aligned (bool) – if False, use the legacy implementation in MMDetection. If True, align the results more perfectly. Default: True.
with_temporal_pool (bool) – if True, avgpool the temporal dim. Default: True.
with_global (bool) – if True, concatenate the RoI feature with global feature. Default: False.

Note that sampling_ratio, pool_mode, aligned only apply when roi_layer_type is set as RoIAlign.

forward(feat, rois)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]¶

The classification head for SlowFast.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]¶

Temporal Adaptive Module(TAM) for TANet.

This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Parameters

in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) – `alpha` in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.
adaptive_kernel_size (int) – `K` in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.
beta (int) – `beta` in the paper and is set to control the model complexity in the local branch. Default: 4.
conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.
adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.
init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The output of the module.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]¶

Temporal Adaptive Network (TANet) backbone.

This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments.
tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().
**kwargs (keyword arguments, optional) – Arguments for ResNet except `depth`.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_tam_modeling()[source]¶: Replace ResNet-Block with TA-Block.

class mmaction.models.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]¶

Temporal Evaluation Model for Boundary Sensetive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters

tem_feat_dim (int) – Feature dimension.
tem_hidden_dim (int) – Hidden layer dimension.
tem_match_threshold (float) – Temporal evaluation match threshold.
loss_cls (dict) – Config for building loss. Default: dict(type='BinaryLogisticRegressionLoss').
loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]¶: Define the computation performed at every call when testing.

forward_train(raw_feature, label_action, label_start, label_end)[source]¶: Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]¶: Generate training labels.

class mmaction.models.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]¶

TPN neck.

This module is proposed in Temporal Pyramid Network for Action Recognition

Parameters

in_channels (tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.
temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.
upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:nn.Upsample. Default: None.
downsample_cfg (dict | None) – Config for downsample layers. Default: None.
level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.
aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.
flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.

forward(x, target=None)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.TPNHead(*args, **kwargs)[source]¶

Class head for TPN.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.

forward(x, num_segs=None, fcn_test=False)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int | None) – Number of segments into which a video is divided. Default: None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.

Returns

The classification scores for input samples.

Return type

torch.Tensor

class mmaction.models.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]¶

Class head for TRN.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]¶

Class head for TSM.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶

Class head for TSN.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int) – Number of segments into which a video is divided.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters

gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).
frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).
act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶

Build residual layer for ResNet3D.

Parameters

block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]¶

Classification head for I3D.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

mmaction.models.build_backbone(cfg)[source]¶: Build backbone.

mmaction.models.build_head(cfg)[source]¶: Build head.

mmaction.models.build_localizer(cfg)[source]¶: Build localizer.

mmaction.models.build_loss(cfg)[source]¶: Build loss.

mmaction.models.build_model(cfg, train_cfg=None, test_cfg=None)[source]¶: Build model.

mmaction.models.build_neck(cfg)[source]¶: Build neck.

mmaction.models.build_recognizer(cfg, train_cfg=None, test_cfg=None)[source]¶: Build recognizer.

recognizers¶

class mmaction.models.recognizers.AudioRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶

Audio recognizer model framework.

forward(audios, label=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_gradcam(audios)[source]¶: Defines the computation performed at every all when using gradcam utils.

forward_test(audios)[source]¶: Defines the computation performed at every call when evaluation and testing.

forward_train(audios, labels)[source]¶: Defines the computation performed at every call when training.

train_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters

data_batch (dict) – The output of dataloader.
optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,: num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.recognizers.BaseRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶

Base class for recognizers.

All recognizers should subclass it. All subclass should overwrite:

Methods:forward_train, supporting to forward when training.
Methods:forward_test, supporting to forward when testing.

Parameters

backbone (dict) – Backbone modules to extract feature.
cls_head (dict) – Classification head to process feature.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.

average_clip(cls_score, num_segs=1)[source]¶

Averaging class score over multiple clips.

Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.

Parameters

cls_score (torch.Tensor) – Class score to be averaged.
num_segs (int) – Number of clips for each input sample.

Returns

Averaged class score.

Return type

torch.Tensor

extract_feat(imgs)[source]¶

Extract features through a backbone.

Parameters: imgs (torch.Tensor) – The input images.
Returns: The extracted features.
Return type: torch.tensor

forward(imgs, label=None, return_loss=True, **kwargs)[source]¶: Define the computation performed at every call.

abstract forward_gradcam(imgs)[source]¶: Defines the computation performed at every all when using gradcam utils.

abstract forward_test(imgs)[source]¶: Defines the computation performed at every call when evaluation and testing.

abstract forward_train(imgs, labels, **kwargs)[source]¶: Defines the computation performed at every call when training.

init_weights()[source]¶: Initialize the model network weights.

train_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters

data_batch (dict) – The output of dataloader.
optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,: num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

property with_neck¶

whether the detector has a neck

Type: bool

class mmaction.models.recognizers.Recognizer2D(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶

2D recognizer model framework.

forward_dummy(imgs, softmax=False)[source]¶

Used for computing network FLOPs.

See tools/analysis/get_flops.py.

Parameters: imgs (torch.Tensor) – Input images.
Returns: Class score.
Return type: Tensor

forward_gradcam(imgs)[source]¶: Defines the computation performed at every call when using gradcam utils.

forward_test(imgs)[source]¶: Defines the computation performed at every call when evaluation and testing.

forward_train(imgs, labels, **kwargs)[source]¶: Defines the computation performed at every call when training.

class mmaction.models.recognizers.Recognizer3D(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶

3D recognizer model framework.

forward_dummy(imgs, softmax=False)[source]¶

Used for computing network FLOPs.

See tools/analysis/get_flops.py.

Parameters: imgs (torch.Tensor) – Input images.
Returns: Class score.
Return type: Tensor

forward_gradcam(imgs)[source]¶: Defines the computation performed at every call when using gradcam utils.

forward_test(imgs)[source]¶: Defines the computation performed at every call when evaluation and testing.

forward_train(imgs, labels, **kwargs)[source]¶: Defines the computation performed at every call when training.

localizers¶

class mmaction.models.localizers.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]¶

Boundary Matching Network for temporal action proposal generation.

Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network

Parameters

temporal_dim (int) – Total frames selected for each video.
boundary_ratio (float) – Ratio for determining video boundaries.
num_samples (int) – Number of samples for each proposal.
num_samples_per_bin (int) – Number of bin samples for each sample.
feat_dim (int) – Feature dimension.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
loss_cls (dict) – Config for building loss. Default: dict(type='BMNLoss').
hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]¶: Define the computation performed at every call when testing.

forward_train(raw_feature, label_confidence, label_start, label_end)[source]¶: Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]¶: Generate training labels.

class mmaction.models.localizers.BaseLocalizer(backbone, cls_head, train_cfg=None, test_cfg=None)[source]¶

Base class for localizers.

All localizers should subclass it. All subclass should overwrite: Methods:forward_train, supporting to forward when training. Methods:forward_test, supporting to forward when testing.

extract_feat(imgs)[source]¶

Extract features through a backbone.

Parameters: imgs (torch.Tensor) – The input images.
Returns: The extracted features.
Return type: torch.tensor

forward(imgs, return_loss=True, **kwargs)[source]¶: Define the computation performed at every call.

abstract forward_test(imgs)[source]¶: Defines the computation performed at testing.

abstract forward_train(imgs, labels)[source]¶: Defines the computation performed at training.

init_weights()[source]¶: Weight initialization for model.

train_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.

Parameters

data_batch (dict) – The output of dataloader.
optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,: num_samples. loss is a tensor for back propagation, which can be a weighted sum of multiple losses. log_vars contains all the variables to be sent to the logger. num_samples indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.

Return type

dict

val_step(data_batch, optimizer, **kwargs)[source]¶

The iteration step during validation.

This method shares the same signature as train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.

class mmaction.models.localizers.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]¶

Proposals Evaluation Model for Boundary Sensetive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters

pem_feat_dim (int) – Feature dimension.
pem_hidden_dim (int) – Hidden layer dimension.
pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.

forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_test(bsp_feature, tmin, tmax, tmin_score, tmax_score, video_meta)[source]¶: Define the computation performed at every call when testing.

forward_train(bsp_feature, reference_temporal_iou)[source]¶: Define the computation performed at every call when training.

class mmaction.models.localizers.SSN(backbone, cls_head, in_channels=3, spatial_type='avg', dropout_ratio=0.5, loss_cls={'type': 'SSNLoss'}, train_cfg=None, test_cfg=None)[source]¶

Temporal Action Detection with Structured Segment Networks.

Parameters

backbone (dict) – Config for building backbone.
cls_head (dict) – Config for building classification head.
in_channels (int) – Number of channels for input data. Default: 3.
spatial_type (str) – Type of spatial pooling. Default: ‘avg’.
dropout_ratio (float) – Ratio of dropout. Default: 0.5.
loss_cls (dict) – Config for building loss. Default: dict(type='SSNLoss').
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.

forward_test(imgs, relative_proposal_list, scale_factor_list, proposal_tick_list, reg_norm_consts, **kwargs)[source]¶: Define the computation performed at every call when testing.

forward_train(imgs, proposal_scale_factor, proposal_type, proposal_labels, reg_targets, **kwargs)[source]¶: Define the computation performed at every call when training.

class mmaction.models.localizers.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]¶

Temporal Evaluation Model for Boundary Sensetive Network.

Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.

Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network

Parameters

tem_feat_dim (int) – Feature dimension.
tem_hidden_dim (int) – Hidden layer dimension.
tem_match_threshold (float) – Temporal evaluation match threshold.
loss_cls (dict) – Config for building loss. Default: dict(type='BinaryLogisticRegressionLoss').
loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.

forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶: Define the computation performed at every call.

forward_test(raw_feature, video_meta)[source]¶: Define the computation performed at every call when testing.

forward_train(raw_feature, label_action, label_start, label_end)[source]¶: Define the computation performed at every call when training.

generate_labels(gt_bbox)[source]¶: Generate training labels.

common¶

class mmaction.models.common.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]¶

(2+1)d Conv module for R(2+1)d backbone.

https://arxiv.org/pdf/1711.11248.pdf.

Parameters

in_channels (int) – Same as nn.Conv3d.
out_channels (int) – Same as nn.Conv3d.
kernel_size (int | tuple[int]) – Same as nn.Conv3d.
stride (int | tuple[int]) – Same as nn.Conv3d.
padding (int | tuple[int]) – Same as nn.Conv3d.
dilation (int | tuple[int]) – Same as nn.Conv3d.
groups (int) – Same as nn.Conv3d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The output of the module.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.common.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]¶

Conv2d module for AudioResNet backbone.

<https://arxiv.org/abs/2001.08740>`_.

Parameters

in_channels (int) – Same as nn.Conv2d.
out_channels (int) – Same as nn.Conv2d.
kernel_size (int | tuple[int]) – Same as nn.Conv2d.
op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.
stride (int | tuple[int]) – Same as nn.Conv2d.
padding (int | tuple[int]) – Same as nn.Conv2d.
dilation (int | tuple[int]) – Same as nn.Conv2d.
groups (int) – Same as nn.Conv2d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The output of the module.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.common.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]¶

Long-Term Feature Bank (LFB).

LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding

The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.

Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:

Parameters

lfb_prefix_path (str) – The storage path of lfb.
max_num_sampled_feat (int) – The max number of sampled features. Default: 5.
window_size (int) – Window size of sampling long term feature. Default: 60.
lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.
dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).
device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.
lmdb_map_size (int) – Map size of lmdb. Default: 4e9.
construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.

class mmaction.models.common.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]¶

Temporal Adaptive Module(TAM) for TANet.

This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Parameters

in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) – `alpha` in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.
adaptive_kernel_size (int) – `K` in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.
beta (int) – `beta` in the paper and is set to control the model complexity in the local branch. Default: 4.
conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.
adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of `Temporal Adaptive Aggregation`. Default: 1.
init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The output of the module.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

backbones¶

class mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, dropout_ratio=0.5, init_std=0.005)[source]¶

C3D backbone.

Parameters

pretrained (str | None) – Name of pretrained model.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses dict(type='Conv3d') to construct layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. required keys are type, Default: None.
act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses dict(type='ReLU') to construct layers. Default: None.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data. the size of x is (num_batches, 3, 16, 112, 112).
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]¶

MobileNetV2 backbone.

Parameters

pretrained (str | None) – Name of pretrained model. Default: None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.
conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_layer(out_channels, num_blocks, stride, expand_ratio)[source]¶

Stack InvertedResidual blocks to build a layer for MobileNetV2.

Parameters

out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Default: 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.

train(mode=True)[source]¶

Sets the module in training mode.

This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

Parameters: mode (bool) – whether to set training mode (True) or evaluation mode (False). Default: True.
Returns: self
Return type: Module

class mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]¶

MobileNetV2 backbone for TSM.

Parameters

num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
shift_div (int) – Number of div for shift. Default: 8.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_shift()[source]¶: Make temporal shift for some layers.

class mmaction.models.backbones.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]¶

ResNet backbone.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
dilations (Sequence[int]) – Dilation of each stage.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
partial_bn (bool) – Whether to use partial bn. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[source]¶

ResNet (2+1)d backbone.

This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

class mmaction.models.backbones.ResNet3d(depth, pretrained, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(5, 7, 7), conv1_stride_t=2, pool1_stride_t=2, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]¶

ResNet 3d backbone.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
in_channels (int) – Channel num of input features. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).
temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default: (1, 1, 1, 1).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (5, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 2.
with_pool2 (bool) – Whether to use pool2. Default: True.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.
conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).
act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default: dict().
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

static make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶

Build residual layer for ResNet3D.

Parameters

block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.
dilation (int) – Spacing between kernel elements. Default: 1.
style (str) – pytorch or caffe. If set to pytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: pytorch.
inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.
non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.
non_local_cfg (dict) – Config for non-local module. Default: dict().
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]¶

ResNet backbone for CSN.

Parameters

depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.

If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.

Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.backbones.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶

ResNet 3d Layer.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
stage (int) – The index of Resnet stage. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
spatial_stride (int) – The 1st res block’s spatial stride. Default 2.
temporal_stride (int) – The 1st res block’s temporal stride. Default 1.
dilation (int) – The dilation. Default: 1.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
all_frozen (bool) – Frozen all modules in the layer. Default: False.
inflate (int) – Inflate Dims of each block. Default: 1.
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.
conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).
act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.backbones.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]¶

Slowfast backbone.

This module is proposed in SlowFast Networks for Video Recognition

Parameters

pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride resample_rate on input frames. The actual resample rate is calculated by multipling the interval in SampleFrames in the pipeline with resample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out of resample_rate * interval frames. Default: 8.
speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.
channel_ratio (int) – Reduce the channel number of fast pathway by channel_ratio, corresponding to \(\beta\) in the paper. Default: 8.
slow_pathway (dict) –
Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:
```
dict(type='ResNetPathway',
lateral=True, depth=50, pretrained=None,
conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1),
conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
```

fast_pathway (dict) –

Configuration of fast branch, similar to slow_pathway. Default:

dict(type='ResNetPathway',
lateral=False, depth=50, pretrained=None, base_channels=8,
conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)

forward(x)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.

Returns

The feature of the input samples extracted: by the backbone.

Return type

tuple[torch.Tensor]

init_weights(pretrained=None)[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

class mmaction.models.backbones.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]¶

SlowOnly backbone based on ResNet3dPathway.

Parameters

*args (arguments) – Arguments same as ResNet3dPathway.
conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).
**kwargs (keyword arguments) – Keywords arguments for ResNet3dPathway.

class mmaction.models.backbones.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]¶

ResNet 2d audio backbone. Reference:

<https://arxiv.org/abs/2001.08740>`_.

Parameters

depth (int) – Depth of resnet, from {50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
in_channels (int) – Channel num of input features. Default: 1.
base_channels (int) – Channel num of stem output features. Default: 32.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.
conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]¶

Build residual layer for ResNetAudio.

Parameters

block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilation (int) – Spacing between kernel elements. Default: 1.
factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

train(mode=True)[source]¶: Set the optimization status when training.

class mmaction.models.backbones.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]¶

ResNet backbone for TIN.

Parameters

depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_interlace()[source]¶: Make temporal interlace for some layers.

class mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]¶

ResNet backbone for TSM.

Parameters

num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default: dict().
shift_div (int) – Number of div for shift. Default: 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Default: False.
**kwargs (keyword arguments, optional) – Arguments for ResNet.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_temporal_pool()[source]¶: Make temporal pooling between layer1 and layer2, using a 3D max pooling layer.

make_temporal_shift()[source]¶: Make temporal shift for some layers.

class mmaction.models.backbones.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]¶

Temporal Adaptive Network (TANet) backbone.

This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION

Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.

Parameters

depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments.
tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().
**kwargs (keyword arguments, optional) – Arguments for ResNet except `depth`.

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_tam_modeling()[source]¶: Replace ResNet-Block with TA-Block.

class mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶

X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.

Parameters

gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default: (1, 2, 2, 2).
frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are type Default: dict(type='Conv3d').
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type='BN3d', requires_grad=True).
act_cfg (dict) – Config dict for activation layer. Default: dict(type='ReLU', inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The feature of the input samples extracted by the backbone.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶

Build residual layer for ResNet3D.

Parameters

block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

Returns

A residual layer for the given config.

Return type

nn.Module

train(mode=True)[source]¶: Set the optimization status when training.

heads¶

class mmaction.models.heads.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶

Classification head for TSN on audio.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.heads.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]¶

Simplest RoI head, with only two fc layers for classification and regression respectively.

Parameters

temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
in_channels (int) – The number of input channels. Default: 2048.
num_classes (int) – The number of classes. Default: 81.
dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.
dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.
topk (int or tuple[int]) – Parameter for evaluating multilabel accuracy. Default: (3, 5)
multilabel (bool) – Whether used for a multilabel task. Default: True. (Only support multilabel == True now).

forward(x)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

recall_prec(pred_vec, target_vec)[source]¶

Parameters

pred_vec (tensor[N x C]) – each element is either 0 or 1
target_vec (tensor[N x C]) – each element is either 0 or 1

class mmaction.models.heads.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0)[source]¶

Base class for head.

All Head should subclass it. All subclass should overwrite: - Methods:init_weights, initializing weights in some modules. - Methods:forward, supporting to forward both for training and testing.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.

abstract forward(x)[source]¶: Defines the computation performed at every call.

abstract init_weights()[source]¶: Initiate the parameters either from existing checkpoint or from scratch.

loss(cls_score, labels, **kwargs)[source]¶

Calculate the loss given output cls_score, target labels.

Parameters

cls_score (torch.Tensor) – The output of the model.
labels (torch.Tensor) – The target output of the model.

Returns

A dict containing field ‘loss_cls’(mandatory) and ‘top1_acc’, ‘top5_acc’(optional).

Return type

dict

class mmaction.models.heads.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max')[source]¶

Feature Bank Operator Head.

Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.

Parameters

lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.
fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_weights(pretrained=None)[source]¶

Initialize the weights in the module.

Parameters: pretrained (str, optional) – Path to pre-trained weights. Default: None.

sample_lfb(rois, img_metas)[source]¶: Sample long-term features for each ROI feature.

class mmaction.models.heads.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]¶

Classification head for I3D.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.heads.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max')[source]¶

Long-Term Feature Bank Infer Head.

This head is used to derive and save the LFB without affecting the input.

Parameters

lfb_prefix_path (str) – The prefix path to store the lfb.
dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.
use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.

forward(x, rois, img_metas)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmaction.models.heads.SSNHead(dropout_ratio=0.8, in_channels=1024, num_classes=20, consensus={'num_seg': (2, 5, 2), 'standalong_classifier': True, 'stpp_cfg': (1, 1, 1), 'type': 'STPPTrain'}, use_regression=True, init_std=0.001)[source]¶

The classification head for SSN.

Parameters

dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
in_channels (int) – Number of channels for input data. Default: 1024.
num_classes (int) – Number of classes to be classified. Default: 20.
consensus (dict) – Config of segmental consensus.
use_regression (bool) – Whether to perform regression or not. Default: True.
init_std (float) – Std value for Initiation. Default: 0.001.

forward(x, test_mode=False)[source]¶: Defines the computation performed at every call.

init_weights()[source]¶: Initiate the parameters from scratch.

prepare_test_fc(stpp_feat_multiplier)[source]¶

Reorganize the shape of fully connected layer at testing, in order to improve testing efficiency.

Parameters: stpp_feat_multiplier (int) – Total number of parts.
Returns: Whether the shape transformation is ready for testing.
Return type: bool

class mmaction.models.heads.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]¶

The classification head for SlowFast.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.heads.TPNHead(*args, **kwargs)[source]¶

Class head for TPN.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.

forward(x, num_segs=None, fcn_test=False)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int | None) – Number of segments into which a video is divided. Default: None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.

Returns

The classification scores for input samples.

Return type

torch.Tensor

class mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]¶

Class head for TRN.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.heads.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]¶

Class head for TSM.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.heads.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶

Class head for TSN.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.

forward(x, num_segs)[source]¶

Defines the computation performed at every call.

Parameters

x (torch.Tensor) – The input data.
num_segs (int) – Number of segments into which a video is divided.

Returns

The classification scores for input samples.

Return type

torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

class mmaction.models.heads.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]¶

Classification head for I3D.

Parameters

num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.

forward(x)[source]¶

Defines the computation performed at every call.

Parameters: x (torch.Tensor) – The input data.
Returns: The classification scores for input samples.
Return type: torch.Tensor

init_weights()[source]¶: Initiate the parameters from scratch.

necks¶

class mmaction.models.necks.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]¶

TPN neck.

This module is proposed in Temporal Pyramid Network for Action Recognition

Parameters

in_channels (tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.
temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.
upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:nn.Upsample. Default: None.
downsample_cfg (dict | None) – Config for downsample layers. Default: None.
level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.
aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.
flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.

forward(x, target=None)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

losses¶

class mmaction.models.losses.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]¶

Binary Cross Entropy Loss with logits.

Parameters

loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.losses.BMNLoss[source]¶

BMN Loss.

From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of

1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.

forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]¶

Calculate Boundary Matching Network Loss.

Parameters

pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.

Returns

(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.

Return type

tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])

static pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]¶

Calculate Proposal Evaluation Module Classification Loss.

Parameters

pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5

Returns

Proposal evalutaion classification loss.

Return type

torch.Tensor

static pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]¶

Calculate Proposal Evaluation Module Regression Loss.

Parameters

pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.

Returns

Proposal evalutaion regression loss.

Return type

torch.Tensor

static tem_loss(pred_start, pred_end, gt_start, gt_end)[source]¶

Calculate Temporal Evaluation Module Loss.

This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.

Parameters

pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.losses.BaseWeightedLoss(loss_weight=1.0)[source]¶

Base class for loss.

All subclass should overwrite the _forward() method which returns the normal loss without loss weights.

Parameters: loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.

forward(*args, **kwargs)[source]¶

Defines the computation performed at every call.

Parameters

*args – The positional arguments for the corresponding loss.
**kwargs – The keyword arguments for the corresponding loss.

Returns

The calculated loss.

Return type

torch.Tensor

class mmaction.models.losses.BinaryLogisticRegressionLoss[source]¶

Binary Logistic Regression Loss.

It will calculate binary logistic regression loss given reg_score and label.

forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]¶

Calculate Binary Logistic Regression Loss.

Parameters

reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.

Returns

Returned binary logistic loss.

Return type

torch.Tensor

class mmaction.models.losses.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]¶

Cross Entropy Loss.

Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of cls_score and label. 1) Hard label: This label is an integer array and all of the elements are

in the range [0, num_classes - 1]. This label’s shape should be cls_score’s shape with the num_classes dimension removed.

Soft label(probablity distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as cls_score. For now, only 2-dim soft label is supported.

Parameters

loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.

class mmaction.models.losses.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]¶

Calculate the BCELoss for HVU.

Parameters

categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.

class mmaction.models.losses.NLLLoss(loss_weight=1.0)[source]¶

NLL Loss.

It will calculate NLL loss given cls_score and label.

class mmaction.models.losses.OHEMHingeLoss[source]¶

This class is the core implementation for the completeness loss in paper.

It compute class-wise hinge loss and performs online hard example mining (OHEM).

static backward(ctx, grad_output)[source]¶

Defines a formula for differentiating the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs did forward() return, and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]¶

Calculate OHEM hinge loss.

Parameters

pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.

Returns

Returned class-wise hinge loss.

Return type

torch.Tensor

class mmaction.models.losses.SSNLoss[source]¶

static activity_loss(activity_score, labels, activity_indexer)[source]¶

Activity Loss.

It will calculate activity loss given activity_score and label.

Args：: activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.

Returns: Returned cross entropy loss.
Return type: torch.Tensor

static classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]¶

Classwise Regression Loss.

It will calculate classwise_regression loss given class_reg_pred and targets.

Args：

bbox_pred (torch.Tensor): Predicted interval center and span: of positive proposals.

labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span

of positive proposals.

regression_indexer (torch.Tensor): Index slices of: positive proposals.

Returns: Returned class-wise regression loss.
Return type: torch.Tensor

static completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]¶

Completeness Loss.

It will calculate completeness loss given completeness_score and label.

Args：

completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and

incomplete proposals.

positive_per_video (int): Number of positive proposals sampled: per video.
incomplete_per_video (int): Number of incomplete proposals sampled: pre video.
ohem_ratio (float): Ratio of online hard example mining.: Default: 0.17.

Returns: Returned class-wise completeness loss.
Return type: torch.Tensor

forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]¶

Calculate Boundary Matching Network Loss.

Parameters

activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.

Returns

(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.

Return type

dict([torch.Tensor, torch.Tensor, torch.Tensor])

mmaction.datasets¶

datasets¶

class mmaction.datasets.AVADataset(ann_file, exclude_file, pipeline, label_file=None, filename_tmpl='img_{:05}.jpg', proposal_file=None, person_det_score_thr=0.9, num_classes=81, custom_classes=None, data_prefix=None, test_mode=False, modality='RGB', num_max_proposals=1000, timestamp_start=900, timestamp_end=1800)[source]¶

AVA dataset for spatial temporal detection.

Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.

This datasets can load information from the following files:

ann_file -> ava_{train, val}_{v2.1, v2.2}.csv
exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv
label_file -> ava_action_list_{v2.1, v2.2}.pbtxt /
              ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt
proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl

Particularly, the proposal_file is a pickle file which contains img_key (in format of {video_id},{timestamp}). Example of a pickle file:

{
    ...
    '0f39OWEqJ24,0902':
        array([[0.011   , 0.157   , 0.655   , 0.983   , 0.998163]]),
    '0f39OWEqJ24,0912':
        array([[0.054   , 0.088   , 0.91    , 0.998   , 0.068273],
               [0.016   , 0.161   , 0.519   , 0.974   , 0.984025],
               [0.493   , 0.283   , 0.981   , 0.984   , 0.983621]]),
    ...
}

Parameters

ann_file (str) – Path to the annotation file like ava_{train, val}_{v2.1, v2.2}.csv.
exclude_file (str) – Path to the excluded timestamp file like ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.
pipeline (list[dict | callable]) – A sequence of data transforms.
label_file (str) – Path to the label file like ava_action_list_{v2.1, v2.2}.pbtxt or ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Default: None.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
proposal_file (str) – Path to the proposal file like ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Default: None.
person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Default: 0.9. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used.
num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)
custom_classes (list[int]) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and num_classes should be equal to len(custom_classes) + 1
data_prefix (str) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
num_max_proposals (int) – Max proposals number to store. Default: 1000.
timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Default: 902.
timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Default: 1798.

dump_results(results, out)[source]¶: Dump data to json/yaml/pickle strings or files.

evaluate(results, metrics=('mAP'), metric_options=None, logger=None)[source]¶

Perform evaluation for common datasets.

Parameters

results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘top_k_accuracy’.
metric_options (dict) – Dict for metric options. Options are topk for top_k_accuracy. Default: dict(top_k_accuracy=dict(topk=(1, 5))).
logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results dict.

Return type

dict

load_annotations()[source]¶: Load the annotation according to ann_file into video_infos.

prepare_test_frames(idx)[source]¶: Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]¶: Prepare the frames for training given the index.

class mmaction.datasets.ActivityNetDataset(ann_file, pipeline, data_prefix=None, test_mode=False)[source]¶

ActivityNet dataset for temporal action localization.

The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:

{
    "v_--1DO2V4K74":  {
        "duration_second": 211.53,
        "duration_frame": 6337,
        "annotations": [
            {
                "segment": [
                    30.025882995319815,
                    205.2318595943838
                ],
                "label": "Rock climbing"
            }
        ],
        "feature_frame": 6336,
        "fps": 30.0,
        "rfps": 29.9579255898
    },
    "v_--6bJUbfpnQ": {
        "duration_second": 26.75,
        "duration_frame": 647,
        "annotations": [
            {
                "segment": [
                    2.578755070202808,
                    24.914101404056165
                ],
                "label": "Drinking beer"
            }
        ],
        "feature_frame": 624,
        "fps": 24.0,
        "rfps": 24.1869158879
    },
    ...
}

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.

dump_results(results, out, output_format, version='VERSION 1.3')[source]¶: Dump data to json/csv files.

evaluate(results, metrics='AR@AN', metric_options={'AR@AN': {'max_avg_proposals': 100, 'temporal_iou_thresholds': array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95])}}, logger=None, **deprecated_kwargs)[source]¶

Evaluation in feature dataset.

Parameters

results (list[dict]) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘AR@AN’.
metric_options (dict) – Dict for metric options. Options are max_avg_proposals, temporal_iou_thresholds for AR@AN. default: {'AR@AN': dict(max_avg_proposals=100, temporal_iou_thresholds=np.linspace(0.5, 0.95, 10))}.
logger (logging.Logger | None) – Training logger. Defaults: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results for evaluation metrics.

Return type

dict

load_annotations()[source]¶: Load the annotation according to ann_file into video_infos.

prepare_test_frames(idx)[source]¶: Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]¶: Prepare the frames for training given the index.

static proposals2json(results, show_progress=False)[source]¶

Convert all proposals to a final dict(json) format.

Parameters

results (list[dict]) – All proposals.
show_progress (bool) – Whether to show the progress bar. Defaults: False.

Returns

The final result dict. E.g.

dict(video-1=[dict(segment=[1.1,2.0]. score=0.9),
              dict(segment=[50.1, 129.3], score=0.6)])

Return type

dict

class mmaction.datasets.AudioDataset(ann_file, pipeline, suffix='.wav', **kwargs)[source]¶

Audio dataset for video recognition. Extracts the audio feature on-the- fly. Annotation file can be that of the rawframe dataset, or:

some/directory-1.wav 163 1
some/directory-2.wav 122 1
some/directory-3.wav 258 2
some/directory-4.wav 234 2
some/directory-5.wav 295 3
some/directory-6.wav 121 3

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
suffix (str) – The suffix of the audio file. Default: ‘.wav’.
kwargs (dict) – Other keyword args for BaseDataset.

load_annotations()[source]¶: Load annotation file to get video information.

class mmaction.datasets.AudioFeatureDataset(ann_file, pipeline, suffix='.npy', **kwargs)[source]¶

Audio feature dataset for video recognition. Reads the features extracted off-line. Annotation file can be that of the rawframe dataset, or:

some/directory-1.npy 163 1
some/directory-2.npy 122 1
some/directory-3.npy 258 2
some/directory-4.npy 234 2
some/directory-5.npy 295 3
some/directory-6.npy 121 3

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
suffix (str) – The suffix of the audio feature file. Default: ‘.npy’.
kwargs (dict) – Other keyword args for BaseDataset.

load_annotations()[source]¶: Load annotation file to get video information.

class mmaction.datasets.AudioVisualDataset(ann_file, pipeline, audio_prefix, **kwargs)[source]¶

Dataset that reads both audio and visual data, supporting both rawframes and videos. The annotation file is same as that of the rawframe dataset, such as:

some/directory-1 163 1
some/directory-2 122 1
some/directory-3 258 2
some/directory-4 234 2
some/directory-5 295 3
some/directory-6 121 3

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
audio_prefix (str) – Directory of the audio files.
kwargs (dict) – Other keyword args for RawframeDataset. video_prefix is also allowed if pipeline is designed for videos.

load_annotations()[source]¶: Load annotation file to get video information.

class mmaction.datasets.BaseDataset(ann_file, pipeline, data_prefix=None, test_mode=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=None)[source]¶

Base class for datasets.

All datasets to process video should subclass it. All subclasses should overwrite:

Methods:load_annotations, supporting to load information from an

annotation file. - Methods:prepare_train_frames, providing train data. - Methods:prepare_test_frames, providing test data.

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
multi_class (bool) – Determines whether the dataset is a multi-class dataset. Default: False.
num_classes (int | None) – Number of classes of the dataset, used in multi-class datasets. Default: None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 1.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’, ‘Audio’. Default: ‘RGB’.
sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.
power (float | None) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: None.

static dump_results(results, out)[source]¶: Dump data to json/yaml/pickle strings or files.

evaluate(results, metrics='top_k_accuracy', metric_options={'top_k_accuracy': {'topk': (1, 5)}}, logger=None, **deprecated_kwargs)[source]¶

Perform evaluation for common datasets.

Parameters

results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘top_k_accuracy’.
metric_options (dict) – Dict for metric options. Options are topk for top_k_accuracy. Default: dict(top_k_accuracy=dict(topk=(1, 5))).
logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results dict.

Return type

dict

abstract load_annotations()[source]¶: Load the annotation according to ann_file into video_infos.

load_json_annotations()[source]¶: Load json annotation file to get video information.

prepare_test_frames(idx)[source]¶: Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]¶: Prepare the frames for training given the index.

class mmaction.datasets.BaseMiniBatchBlending(num_classes)[source]¶: Base class for Image Aliasing.

class mmaction.datasets.CutmixBlending(num_classes, alpha=0.2)[source]¶

Implementing Cutmix in a mini-batch.

This module is proposed in CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. Code Reference https://github.com/clovaai/CutMix-PyTorch

Parameters

num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.

do_blending(imgs, label, **kwargs)[source]¶: Blending images with cutmix.

static rand_bbox(img_size, lam)[source]¶: Generate a random boudning box.

class mmaction.datasets.HVUDataset(ann_file, pipeline, tag_categories, tag_category_nums, filename_tmpl=None, **kwargs)[source]¶

HVU dataset, which supports the recognition tags of multiple categories. Accept both video annotation files or rawframe annotation files.

The dataset loads videos or raw frames and applies specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a json file with multiple dictionaries, and each dictionary indicates a sample video with the filename and tags, the tags are organized as different categories. Example of a video dictionary:

{
    'filename': 'gD_G1b0wV5I_001015_001035.mp4',
    'label': {
        'concept': [250, 131, 42, 51, 57, 155, 122],
        'object': [1570, 508],
        'event': [16],
        'action': [180],
        'scene': [206]
    }
}

Example of a rawframe dictionary:

{
    'frame_dir': 'gD_G1b0wV5I_001015_001035',
    'total_frames': 61
    'label': {
        'concept': [250, 131, 42, 51, 57, 155, 122],
        'object': [1570, 508],
        'event': [16],
        'action': [180],
        'scene': [206]
    }
}

Parameters

ann_file (str) – Path to the annotation file, should be a json file.
pipeline (list[dict | callable]) – A sequence of data transforms.
tag_categories (list[str]) – List of category names of tags.
tag_category_nums (list[int]) – List of number of tags in each category.
filename_tmpl (str | None) – Template for each filename. If set to None, video dataset is used. Default: None.
**kwargs – Keyword arguments for BaseDataset.

evaluate(results, metrics='mean_average_precision', metric_options=None, logger=None)[source]¶

Evaluation in HVU Video Dataset. We only support evaluating mAP for each tag categories. Since some tag categories are missing for some videos, we can not evaluate mAP for all tags.

Parameters

results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mean_average_precision’.
metric_options (dict | None) – Dict for metric options. Default: None.
logger (logging.Logger | None) – Logger for recording. Default: None.

Returns

Evaluation results dict.

Return type

dict

load_annotations()[source]¶: Load annotation file to get video information.

load_json_annotations()[source]¶: Load json annotation file to get video information.

class mmaction.datasets.ImageDataset(ann_file, pipeline, **kwargs)[source]¶

Image dataset for action recognition, used in the Project OmniSource.

The dataset loads image list and apply specified transforms to return a dict containing the image tensors and other information. For the ImageDataset

The ann_file is a text file with multiple lines, and each line indicates the image path and the image label, which are split with a whitespace. Example of a annotation file:

path/to/image1.jpg 1
path/to/image2.jpg 1
path/to/image3.jpg 2
path/to/image4.jpg 2
path/to/image5.jpg 3
path/to/image6.jpg 3

Example of a multi-class annotation file:

path/to/image1.jpg 1 3 5
path/to/image2.jpg 1 2
path/to/image3.jpg 2
path/to/image4.jpg 2 4 6 8
path/to/image5.jpg 3
path/to/image6.jpg 3

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
**kwargs – Keyword arguments for BaseDataset.

class mmaction.datasets.MixupBlending(num_classes, alpha=0.2)[source]¶

Implementing Mixup in a mini-batch.

This module is proposed in mixup: Beyond Empirical Risk Minimization. Code Reference https://github.com/open-mmlab/mmclassification/blob/master/mmcls/models/utils/mixup.py # noqa

Parameters

num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.

do_blending(imgs, label, **kwargs)[source]¶: Blending images with mixup.

class mmaction.datasets.RawVideoDataset(ann_file, pipeline, clipname_tmpl='part_{}.mp4', sampling_strategy='positive', **kwargs)[source]¶

RawVideo dataset for action recognition, used in the Project OmniSource.

The dataset loads clips of raw videos and apply specified transforms to return a dict containing the frame tensors and other information. Not that for this dataset, multi_class should be False.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath (without suffix), label, number of clips and index of positive clips (starting from 0), which are split with a whitespace. Raw videos should be first trimmed into 10 second clips, organized in the following format:

some/path/D32_1gwq35E/part_0.mp4
some/path/D32_1gwq35E/part_1.mp4
......
some/path/D32_1gwq35E/part_n.mp4

Example of a annotation file:

some/path/D32_1gwq35E 66 10 0 1 2
some/path/-G-5CJ0JkKY 254 5 3 4
some/path/T4h1bvOd9DA 33 1 0
some/path/4uZ27ivBl00 341 2 0 1
some/path/0LfESFkfBSw 186 234 7 9 11
some/path/-YIsNpBEx6c 169 100 9 10 11

The first line indicates that the raw video some/path/D32_1gwq35E has action label 66, consists of 10 clips (from part_0.mp4 to part_9.mp4). The 1st, 2nd and 3rd clips are positive clips.

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
sampling_strategy (str) – The strategy to sample clips from raw videos. Choices are ‘random’ or ‘positive’. Default: ‘positive’.
clipname_tmpl (str) – The template of clip name in the raw video. Default: ‘part_{}.mp4’.
**kwargs – Keyword arguments for BaseDataset.

load_annotations()[source]¶: Load annotation file to get video information.

load_json_annotations()[source]¶: Load json annotation file to get video information.

prepare_test_frames(idx)[source]¶: Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]¶: Prepare the frames for training given the index.

sample_clip(results)[source]¶: Sample a clip from the raw video given the sampling strategy.

class mmaction.datasets.RawframeDataset(ann_file, pipeline, data_prefix=None, test_mode=False, filename_tmpl='img_{:05}.jpg', with_offset=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=None)[source]¶

Rawframe dataset for action recognition.

The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:

some/directory-1 163 1
some/directory-2 122 1
some/directory-3 258 2
some/directory-4 234 2
some/directory-5 295 3
some/directory-6 121 3

Example of a multi-class annotation file:

some/directory-1 163 1 3 5
some/directory-2 122 1 2
some/directory-3 258 2
some/directory-4 234 2 4 6 8
some/directory-5 295 3
some/directory-6 121 3

Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.

some/directory-1 12 163 3
some/directory-2 213 122 4
some/directory-3 100 258 5
some/directory-4 98 234 2
some/directory-5 0 295 3
some/directory-6 50 121 3

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
with_offset (bool) – Determines whether the offset information is in ann_file. Default: False.
multi_class (bool) – Determines whether it is a multi-class recognition dataset. Default: False.
num_classes (int | None) – Number of classes in the dataset. Default: None.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.
power (float | None) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: None.

load_annotations()[source]¶: Load annotation file to get video information.

prepare_test_frames(idx)[source]¶: Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]¶: Prepare the frames for training given the index.

class mmaction.datasets.RepeatDataset(dataset, times)[source]¶

A wrapper of repeated dataset.

The length of repeated dataset will be times larger than the original dataset. This is useful when the data loading time is long but the dataset is small. Using RepeatDataset can reduce the data loading time between epochs.

Parameters

dataset (Dataset) – The dataset to be repeated.
times (int) – Repeat times.

class mmaction.datasets.SSNDataset(ann_file, pipeline, train_cfg, test_cfg, data_prefix, test_mode=False, filename_tmpl='img_{:05d}.jpg', start_index=1, modality='RGB', video_centric=True, reg_normalize_constants=None, body_segments=5, aug_segments=(2, 2), aug_ratio=(0.5, 0.5), clip_len=1, frame_interval=1, filter_gt=True, use_regression=True, verbose=False)[source]¶

Proposal frame dataset for Structured Segment Networks.

Based on proposal information, the dataset loads raw frames and applies specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines and each video’s information takes up several lines. This file can be a normalized file with percent or standard file with specific frame indexes. If the file is a normalized file, it will be converted into a standard file first.

Template information of a video in a standard file: .. code-block:: txt

# index video_id num_frames fps num_gts label, start_frame, end_frame label, start_frame, end_frame … num_proposals label, best_iou, overlap_self, start_frame, end_frame label, best_iou, overlap_self, start_frame, end_frame …

Example of a standard annotation file: .. code-block:: txt

# 0 video_validation_0000202 5666 1 3 8 130 185 8 832 1136 8 1303 1381 5 8 0.0620 0.0620 790 5671 8 0.1656 0.1656 790 2619 8 0.0833 0.0833 3945 5671 8 0.0960 0.0960 4173 5671 8 0.0614 0.0614 3327 5671

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
train_cfg (dict) – Config for training.
test_cfg (dict) – Config for testing.
data_prefix (str) – Path to a directory where videos are held.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
start_index (int) – Specify a start index for frames in consideration of different filename format. Default: 1.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
video_centric (bool) – Whether to sample proposals just from this video or sample proposals randomly from the entire dataset. Default: True.
reg_normalize_constants (list) – Regression target normalized constants, including mean and standard deviation of location and duration.
body_segments (int) – Number of segments in course period. Default: 5.
aug_segments (list[int]) – Number of segments in starting and ending period. Default: (2, 2).
aug_ratio (int | float | tuple[int | float]) – The ratio of the length of augmentation to that of the proposal. Defualt: (0.5, 0.5).
clip_len (int) – Frames of each sampled output clip. Default: 1.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
filter_gt (bool) – Whether to filter videos with no annotation during training. Default: True.
use_regression (bool) – Whether to perform regression. Default: True.
verbose (bool) – Whether to print full information or not. Default: False.

construct_proposal_pools()[source]¶: Construct positve proposal pool, incomplete proposal pool and background proposal pool of the entire dataset.

evaluate(results, metrics='mAP', metric_options={'mAP': {'eval_dataset': 'thumos14'}}, logger=None, **deprecated_kwargs)[source]¶

Evaluation in SSN proposal dataset.

Parameters

results (list[dict]) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mAP’.
metric_options (dict) – Dict for metric options. Options are eval_dataset for mAP. Default: dict(mAP=dict(eval_dataset='thumos14')).
logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.

Returns

Evaluation results for evaluation metrics.

Return type

dict

get_all_gts()[source]¶: Fetch groundtruth instances of the entire dataset.

static get_negatives(proposals, incomplete_iou_threshold, background_iou_threshold, background_coverage_threshold=0.01, incomplete_overlap_threshold=0.7)[source]¶

Get negative proposals, including incomplete proposals and background proposals.

Parameters

proposals (list) – List of proposal instances(SSNInstance).
incomplete_iou_threshold (float) – Maximum threshold of overlap of incomplete proposals and groundtruths.
background_iou_threshold (float) – Maximum threshold of overlap of background proposals and groundtruths.
background_coverage_threshold (float) – Minimum coverage of background proposals in video duration. Default: 0.01.
incomplete_overlap_threshold (float) – Minimum percent of incomplete proposals’ own span contained in a groundtruth instance. Default: 0.7.

Returns

(incompletes, backgrounds), incompletes: and backgrounds are lists comprised of incomplete proposal instances and background proposal instances.

Return type

list[SSNInstance]

static get_positives(gts, proposals, positive_threshold, with_gt=True)[source]¶

Get positive/foreground proposals.

Parameters

gts (list) – List of groundtruth instances(SSNInstance).
proposals (list) – List of proposal instances(SSNInstance).
positive_threshold (float) – Minimum threshold of overlap of positive/foreground proposals and groundtruths.
with_gt (bool) – Whether to include groundtruth instances in positive proposals. Default: True.

Returns

(positives), positives is a list: comprised of positive proposal instances.

Return type

list[SSNInstance]

load_annotations()[source]¶: Load annotation file to get video information.

prepare_test_frames(idx)[source]¶: Prepare the frames for testing given the index.

prepare_train_frames(idx)[source]¶: Prepare the frames for training given the index.

results_to_detections(results, top_k=2000, **kwargs)[source]¶

Convert prediction results into detections.

Parameters

results (list) – Prediction results.
top_k (int) – Number of top results. Default: 2000.

Returns

Detection results.

Return type

list

class mmaction.datasets.VideoDataset(ann_file, pipeline, start_index=0, **kwargs)[source]¶

Video dataset for action recognition.

The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.

The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:

some/path/000.mp4 1
some/path/001.mp4 1
some/path/002.mp4 2
some/path/003.mp4 2
some/path/004.mp4 3
some/path/005.mp4 3

Parameters

ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 0.
**kwargs – Keyword arguments for BaseDataset.

load_annotations()[source]¶: Load annotation file to get video information.

mmaction.datasets.build_dataloader(dataset, videos_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, drop_last=False, pin_memory=True, **kwargs)[source]¶

Build PyTorch DataLoader.

In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.

Parameters

dataset (Dataset) – A PyTorch dataset.
videos_per_gpu (int) – Number of videos on each GPU, i.e., batch size of each GPU.
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
num_gpus (int) – Number of GPUs. Only used in non-distributed training. Default: 1.
dist (bool) – Distributed training/test or not. Default: True.
shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.
seed (int | None) – Seed to be used. Default: None.
drop_last (bool) – Whether to drop the last incomplete batch in epoch. Default: False
pin_memory (bool) – Whether to use pin_memory in DataLoader. Default: True
kwargs (dict, optional) – Any keyword argument to be used to initialize DataLoader.

Returns

A PyTorch dataloader.

Return type

DataLoader

mmaction.datasets.build_dataset(cfg, default_args=None)[source]¶

Build a dataset from config dict.

Parameters

cfg (dict) – Config dict. It should at least contain the key “type”.
default_args (dict | None, optional) – Default initialization arguments. Default: None.

Returns

The constructed dataset.

Return type

Dataset

pipelines¶

class mmaction.datasets.pipelines.AudioAmplify(ratio)[source]¶

Amplify the waveform.

Required keys are “audios”, added or modified keys are “audios”, “amplify_ratio”.

Parameters: ratio (float) – The ratio used to amplify the audio waveform.

class mmaction.datasets.pipelines.AudioDecode(fixed_length=32000)[source]¶

Sample the audio w.r.t. the frames selected.

Parameters: fixed_length (int) – As the audio clip selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Default: 32000.

Required keys are “frame_inds”, “num_clips”, “total_frames”, “length”, added or modified keys are “audios”, “audios_shape”.

class mmaction.datasets.pipelines.AudioDecodeInit(io_backend='disk', sample_rate=16000, pad_method='zero', **kwargs)[source]¶

Using librosa to initialize the audio reader.

Required keys are “audio_path”, added or modified keys are “length”, “sample_rate”, “audios”.

Parameters

io_backend (str) – io backend where frames are store. Default: ‘disk’.
sample_rate (int) – Audio sampling times per second. Default: 16000.

class mmaction.datasets.pipelines.AudioFeatureSelector(fixed_length=128)[source]¶

Sample the audio feature w.r.t. the frames selected.

Required keys are “audios”, “frame_inds”, “num_clips”, “length”, “total_frames”, added or modified keys are “audios”, “audios_shape”.

Parameters: fixed_length (int) – As the features selected by frames sampled may not be extactly the same, fixed_length will truncate or pad them into the same size. Default: 128.

class mmaction.datasets.pipelines.BuildPseudoClip(clip_len)[source]¶

Build pseudo clips with one single image by repeating it n times.

Required key is “imgs”, added or modified key is “imgs”, “num_clips”,: “clip_len”.

Parameters: clip_len (int) – Frames of the generated pseudo clips.

class mmaction.datasets.pipelines.CenterCrop(crop_size, lazy=False)[source]¶

Crop the center area from images.

Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.

Parameters

crop_size (int | tuple[int]) – (w, h) of crop size.
lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.Collect(keys, meta_keys=('filename', 'label', 'original_shape', 'img_shape', 'pad_shape', 'flip_direction', 'img_norm_cfg'), meta_name='img_metas', nested=False)[source]¶

Collect data from the loader relevant to the specific task.

This keeps the items in keys as it is, and collect items in meta_keys into a meta item called meta_name.This is usually the last stage of the data loader pipeline. For example, when keys=’imgs’, meta_keys=(‘filename’, ‘label’, ‘original_shape’), meta_name=’img_metas’, the results will be a dict with keys ‘imgs’ and ‘img_metas’, where ‘img_metas’ is a DataContainer of another dict with keys ‘filename’, ‘label’, ‘original_shape’.

Parameters

keys (Sequence[str]) – Required keys to be collected.
meta_name (str) – The name of the key that contains meta infomation. This key is always populated. Default: “img_metas”.
meta_keys (Sequence[str]) –
Keys that are collected under meta_name. The contents of the meta_name dictionary depends on meta_keys. By default this includes:
- ”filename”: path to the image file
- ”label”: label of the image file
- ”original_shape”: original shape of the image as a tuple
  (h, w, c)
- ”img_shape”: shape of the image input to the network as a tuple
  (h, w, c). Note that images may be zero padded on the bottom/right, if the batch tensor is larger than this shape.
- ”pad_shape”: image shape after padding
- ”flip_direction”: a str in (“horiziontal”, “vertival”) to
  indicate if the image is fliped horizontally or vertically.
- ”img_norm_cfg”: a dict of normalization information:
  - mean - per channel mean subtraction
  - std - per channel std divisor
  - to_rgb - bool indicating if bgr was converted to rgb
nested (bool) – If set as True, will apply data[x] = [data[x]] to all items in data. The arg is added for compatibility. Default: False.

class mmaction.datasets.pipelines.ColorJitter(color_space_aug=False, alpha_std=0.1, eig_val=None, eig_vec=None)[source]¶

Randomly distort the brightness, contrast, saturation and hue of images, and add PCA based noise into images.

Note: The input images should be in RGB channel order.

Code Reference: https://gluon-cv.mxnet.io/_modules/gluoncv/data/transforms/experimental/image.html https://mxnet.apache.org/api/python/docs/_modules/mxnet/image/image.html#LightingAug

If specified to apply color space augmentation, it will distort the image color space by changing brightness, contrast and saturation. Then, it will add some random distort to the images in different color channels. Note that the input images should be in original range [0, 255] and in RGB channel sequence.

Required keys are “imgs”, added or modified keys are “imgs”, “eig_val”, “eig_vec”, “alpha_std” and “color_space_aug”.

Parameters

color_space_aug (bool) – Whether to apply color space augmentations. If specified, it will change the brightness, contrast, saturation and hue of images, then add PCA based noise to images. Otherwise, it will directly add PCA based noise to images. Default: False.
alpha_std (float) – Std in the normal Gaussian distribution of alpha.
eig_val (np.ndarray | None) – Eigenvalues of [1 x 3] size for RGB channel jitter. If set to None, it will use the default eigenvalues. Default: None.
eig_vec (np.ndarray | None) – Eigenvectors of [3 x 3] size for RGB channel jitter. If set to None, it will use the default eigenvectors. Default: None.

static brightness(img, delta)[source]¶

Brightness distortion.

Parameters

img (np.ndarray) – An input image.
delta (float) – Delta value to distort brightness. It ranges from [-32, 32).

Returns

A brightness distorted image.

Return type

np.ndarray

static contrast(img, alpha)[source]¶

Contrast distortion.

Parameters

img (np.ndarray) – An input image.
alpha (float) – Alpha value to distort contrast. It ranges from [0.6, 1.4).

Returns

A contrast distorted image.

Return type

np.ndarray

static hue(img, alpha)[source]¶

Hue distortion.

Parameters

img (np.ndarray) – An input image.
alpha (float) – Alpha value to control the degree of rotation for hue. It ranges from [-18, 18).

Returns

A hue distorted image.

Return type

np.ndarray

static saturation(img, alpha)[source]¶

Saturation distortion.

Parameters

img (np.ndarray) – An input image.
alpha (float) – Alpha value to distort the saturation. It ranges from [0.6, 1.4).

Returns

A saturation distorted image.

Return type

np.ndarray

class mmaction.datasets.pipelines.Compose(transforms)[source]¶

Compose a data pipeline with a sequence of transforms.

Parameters: transforms (list[dict | callable]) – Either config dicts of transforms or transform objects.

class mmaction.datasets.pipelines.DecordDecode[source]¶

Using decord to decode the video.

Decord: https://github.com/dmlc/decord

Required keys are “video_reader”, “filename” and “frame_inds”, added or modified keys are “imgs” and “original_shape”.

class mmaction.datasets.pipelines.DecordInit(io_backend='disk', num_threads=1, **kwargs)[source]¶

Using decord to initialize the video_reader.

Decord: https://github.com/dmlc/decord

Required keys are “filename”, added or modified keys are “video_reader” and “total_frames”.

class mmaction.datasets.pipelines.DenseSampleFrames(clip_len, frame_interval=1, num_clips=1, sample_range=64, num_sample_positions=10, temporal_jitter=False, out_of_bound_opt='loop', test_mode=False)[source]¶

Select frames from the video by dense sample strategy.

Required keys are “filename”, added or modified keys are “total_frames”, “frame_inds”, “frame_interval” and “num_clips”.

Parameters

clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
num_clips (int) – Number of clips to be sampled. Default: 1.
sample_range (int) – Total sample range for dense sample. Default: 64.
num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Default: 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
test_mode (bool) – Store True when building test or validation dataset. Default: False.

class mmaction.datasets.pipelines.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, lazy=False)[source]¶

Flip the input images with a probability.

Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered. Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.

Parameters

flip_ratio (float) – Probability of implementing flip. Default: 0.5.
direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.
flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.
lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.FormatAudioShape(input_format)[source]¶

Format final audio shape to the given input_format.

Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.

Parameters: input_format (str) – Define the final imgs format.

class mmaction.datasets.pipelines.FormatShape(input_format, collapse=False)[source]¶

Format final imgs shape to the given input_format.

Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.

Parameters

input_format (str) – Define the final imgs format.
collapse (bool) – To collpase input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Default: False.

class mmaction.datasets.pipelines.FrameSelector(*args, **kwargs)[source]¶: Deprecated class for RawFrameDecode.

class mmaction.datasets.pipelines.Fuse[source]¶

Fuse lazy operations.

Fusion order:: crop -> resize -> flip

Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.

class mmaction.datasets.pipelines.GenerateLocalizationLabels[source]¶

Load video label for localizer with given video_name list.

Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.

class mmaction.datasets.pipelines.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]¶

Load and decode images.

Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

Parameters

io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.
kwargs (dict, optional) – Arguments for FileClient.

class mmaction.datasets.pipelines.ImageToTensor(keys)[source]¶

Convert image type to torch.Tensor type.

Parameters: keys (Sequence[str]) – Required keys to be converted.

class mmaction.datasets.pipelines.Imgaug(transforms)[source]¶

Imgaug augmentation.

Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.

It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.

Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.

It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.

Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways

to create transforms. 1) string: only support default for now.

e.g. transforms=’default’

list[dict]: create a list of augmenters by a list of dicts, each
dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]

iaa.Augmenter: create an imgaug.Augmenter object.
e.g. transforms=iaa.Rotate(rotate=(-20, 20))

Add Imgaug in dataset pipeline. It is recommended to insert imgaug
pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [

dict(
type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,

), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(

type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),

dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])

Parameters: transforms (str | list[dict] | iaa.Augmenter) – Three different ways to create imgaug augmenter.

default_transforms()[source]¶

Default transforms for imgaug.

Implement RandAugment by imgaug. Plase visit https://arxiv.org/abs/1909.13719 for more information.

Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa

Miss one augmenter SolarizeAdd since imgaug doesn’t support this.

Returns: The constructed RandAugment transforms.
Return type: dict

imgaug_builder(cfg)[source]¶

Import a module from imgaug.

It follows the logic of build_from_cfg(). Use a dict object to create an iaa.Augmenter object.

Parameters: cfg (dict) – Config dict. It should at least contain the key “type”.
Returns: iaa.Augmenter: The constructed imgaug augmenter.
Return type: obj

class mmaction.datasets.pipelines.LoadAudioFeature(pad_method='zero')[source]¶

Load offline extracted audio features.

Required keys are “audio_path”, added or modified keys are “length”, audios”.

class mmaction.datasets.pipelines.LoadHVULabel(**kwargs)[source]¶

Convert the HVU label from dictionaries to torch tensors.

Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.

class mmaction.datasets.pipelines.LoadLocalizationFeature(raw_feature_ext='.csv')[source]¶

Load Video features for localizer with given video_name list.

Required keys are “video_name” and “data_prefix”, added or modified keys are “raw_feature”.

Parameters: raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.

class mmaction.datasets.pipelines.LoadProposals(top_k, pgm_proposals_dir, pgm_features_dir, proposal_ext='.csv', feature_ext='.npy')[source]¶

Loading proposals with given proposal results.

Required keys are “video_name”, added or modified keys are ‘bsp_feature’, ‘tmin’, ‘tmax’, ‘tmin_score’, ‘tmax_score’ and ‘reference_temporal_iou’.

Parameters

top_k (int) – The top k proposals to be loaded.
pgm_proposals_dir (str) – Directory to load proposals.
pgm_features_dir (str) – Directory to load proposal features.
proposal_ext (str) – Proposal file extension. Default: ‘.csv’.
feature_ext (str) – Feature file extension. Default: ‘.npy’.

class mmaction.datasets.pipelines.MelSpectrogram(window_size=32, step_size=16, n_mels=80, fixed_length=128)[source]¶

MelSpectrogram. Transfer an audio wave into a melspectogram figure.

Required keys are “audios”, “sample_rate”, “num_clips”, added or modified keys are “audios”.

Parameters

window_size (int) – The window size in milisecond. Default: 32.
step_size (int) – The step size in milisecond. Default: 16.
n_mels (int) – Number of mels. Default: 80.
fixed_length (int) – The sample length of melspectrogram maybe not exactly as wished due to different fps, fix the length for batch collation by truncating or padding. Default: 128.

class mmaction.datasets.pipelines.MultiGroupCrop(crop_size, groups)[source]¶

Randomly crop the images into several groups.

Crop the random region with the same given crop_size and bounding box into several groups. Required keys are “imgs”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

Parameters

crop_size (int | tuple[int]) – (w, h) of crop size.
groups (int) – Number of groups.

class mmaction.datasets.pipelines.MultiScaleCrop(input_size, scales=(1), max_wh_scale_gap=1, random_crop=False, num_fixed_crops=5, lazy=False)[source]¶

Crop images with a list of randomly selected scales.

Randomly select the w and h scales from a list of scales. Scale of 1 means the base size, which is the minimal of image width and height. The scale level of w and h is controlled to be smaller than a certain value to prevent too large or small aspect ratio. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox”, “img_shape”, “lazy” and “scales”. Required keys in “lazy” are “crop_bbox”, added or modified key is “crop_bbox”.

Parameters

input_size (int | tuple[int]) – (w, h) of network input.
scales (tuple[float]) – width and height scales to be selected.
max_wh_scale_gap (int) – Maximum gap of w and h scale levels. Default: 1.
random_crop (bool) – If set to True, the cropping bbox will be randomly sampled, otherwise it will be sampler from fixed regions. Default: False.
num_fixed_crops (int) – If set to 5, the cropping bbox will keep 5 basic fixed regions: “upper left”, “upper right”, “lower left”, “lower right”, “center”. If set to 13, the cropping bbox will append another 8 fix regions: “center left”, “center right”, “lower center”, “upper center”, “upper left quarter”, “upper right quarter”, “lower left quarter”, “lower right quarter”. Default: 5.
lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.Normalize(mean, std, to_bgr=False, adjust_magnitude=False)[source]¶

Normalize images with the given mean and std value.

Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs” and “img_norm_cfg”. If modality is ‘Flow’, additional keys “scale_factor” is required

Parameters

mean (Sequence[float]) – Mean values of different channels.
std (Sequence[float]) – Std values of different channels.
to_bgr (bool) – Whether to convert channels from RGB to BGR. Default: False.
adjust_magnitude (bool) – Indicate whether to adjust the flow magnitude on ‘scale_factor’ when modality is ‘Flow’. Default: False.

class mmaction.datasets.pipelines.OpenCVDecode[source]¶

Using OpenCV to decode the video.

Required keys are “video_reader”, “filename” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

class mmaction.datasets.pipelines.OpenCVInit(io_backend='disk', **kwargs)[source]¶

Using OpenCV to initialize the video_reader.

Required keys are “filename”, added or modified keys are “new_path”, “video_reader” and “total_frames”.

class mmaction.datasets.pipelines.PyAVDecode(multi_thread=False)[source]¶

Using pyav to decode the video.

PyAV: https://github.com/mikeboers/PyAV

Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

Parameters: multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.

class mmaction.datasets.pipelines.PyAVDecodeMotionVector(multi_thread=False)[source]¶

Using pyav to decode the motion vectors from video.

Reference: https://github.com/PyAV-Org/PyAV/: blob/main/tests/test_decode.py

Required keys are “video_reader” and “frame_inds”, added or modified keys are “motion_vectors”, “frame_inds”.

Parameters: multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.

class mmaction.datasets.pipelines.PyAVInit(io_backend='disk', **kwargs)[source]¶

Using pyav to initialize the video.

PyAV: https://github.com/mikeboers/PyAV

Required keys are “filename”, added or modified keys are “video_reader”, and “total_frames”.

Parameters

io_backend (str) – io backend where frames are store. Default: ‘disk’.
kwargs (dict) – Args for file client.

class mmaction.datasets.pipelines.RandomCrop(size, lazy=False)[source]¶

Vanilla square random crop that specifics the output size.

Required keys in results are “imgs” and “img_shape”, added or modified keys are “imgs”, “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.

Parameters

size (int) – The output size of the images.
lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.RandomRescale(scale_range, interpolation='bilinear')[source]¶

Randomly resize images so that the short_edge is resized to a specific size in a given range. The scale ratio is unchanged after resizing.

Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “resize_size”, “short_edge”.

Parameters

scale_range (tuple[int]) – The range of short edge length. A closed interval.
interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.

class mmaction.datasets.pipelines.RandomResizedCrop(area_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333333333333333), lazy=False)[source]¶

Random crop that specifics the area and height-weight ratio range.

Required keys in results are “imgs”, “img_shape”, “crop_bbox” and “lazy”, added or modified keys are “imgs”, “crop_bbox” and “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.

Parameters

area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).
aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3).
lazy (bool) – Determine whether to apply lazy operation. Default: False.

static get_crop_bbox(img_shape, area_range, aspect_ratio_range, max_attempts=10)[source]¶

Get a crop bbox given the area range and aspect ratio range.

Parameters

img_shape (Tuple[int]) – Image shape
area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).
aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3). max_attempts (int): The maximum of attempts. Default: 10.
max_attempts (int) – Max attempts times to generate random candidate bounding box. If it doesn’t qualified one, the center bounding box will be used.

Returns

(list[int]) A random crop bbox within the area range and aspect ratio range.

class mmaction.datasets.pipelines.RandomScale(scales, mode='range', **kwargs)[source]¶

Resize images by a random scale.

Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “scale”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.

Parameters

scales (tuple[int]) – Tuple of scales to be chosen for resize.
mode (str) – Selection mode for choosing the scale. Options are “range” and “value”. If set to “range”, The short edge will be randomly chosen from the range of minimum and maximum on the shorter one in all tuples. Otherwise, the longer edge will be randomly chosen from the range of minimum and maximum on the longer one in all tuples. Default: ‘range’.

class mmaction.datasets.pipelines.RawFrameDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]¶

Load and decode frames with given indices.

Required keys are “frame_dir”, “filename_tmpl” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.

Parameters

io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.
kwargs (dict, optional) – Arguments for FileClient.

class mmaction.datasets.pipelines.Rename(mapping)[source]¶

Rename the key in results.

Parameters: mapping (dict) – The keys in results that need to be renamed. The key of the dict is the original name, while the value is the new name. If the original name not found in results, do nothing. Default: dict().

class mmaction.datasets.pipelines.Resize(scale, keep_ratio=True, interpolation='bilinear', lazy=False)[source]¶

Resize images to a specific size.

Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.

Parameters

scale (float | Tuple[int]) – If keep_ratio is True, it serves as scaling factor or maximum size: If it is a float number, the image will be rescaled by this factor, else if it is a tuple of 2 integers, the image will be rescaled as large as possible within the scale. Otherwise, it serves as (w, h) of output size.
keep_ratio (bool) – If set to True, Images will be resized without changing the aspect ratio. Otherwise, it will resize images to a given size. Default: True.
interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.
lazy (bool) – Determine whether to apply lazy operation. Default: False.

class mmaction.datasets.pipelines.SampleAVAFrames(clip_len, frame_interval=2, test_mode=False)[source]¶

class mmaction.datasets.pipelines.SampleFrames(clip_len, frame_interval=1, num_clips=1, temporal_jitter=False, twice_sample=False, out_of_bound_opt='loop', test_mode=False, start_index=None)[source]¶

Sample frames from the video.

Required keys are “filename”, “total_frames”, “start_index” , added or modified keys are “frame_inds”, “frame_interval” and “num_clips”.

Parameters

clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
num_clips (int) – Number of clips to be sampled. Default: 1.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
twice_sample (bool) – Whether to use twice sample when testing. If set to True, it will sample frames with and without fixed shift, which is commonly used for testing in TSM model. Default: False.
out_of_bound_opt (str) – The way to deal with out of bounds frame indexes. Available options are ‘loop’, ‘repeat_last’. Default: ‘loop’.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
start_index (None) – This argument is deprecated and moved to dataset class (BaseDataset, VideoDatset, RawframeDataset, etc), see this: https://github.com/open-mmlab/mmaction2/pull/89.

class mmaction.datasets.pipelines.SampleProposalFrames(clip_len, body_segments, aug_segments, aug_ratio, frame_interval=1, test_interval=6, temporal_jitter=False, mode='train')[source]¶

Sample frames from proposals in the video.

Required keys are “total_frames” and “out_proposals”, added or modified keys are “frame_inds”, “frame_interval”, “num_clips”, ‘clip_len’ and ‘num_proposals’.

Parameters

clip_len (int) – Frames of each sampled output clip.
body_segments (int) – Number of segments in course period.
aug_segments (list[int]) – Number of segments in starting and ending period.
aug_ratio (int | float | tuple[int | float]) – The ratio of the length of augmentation to that of the proposal.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
test_interval (int) – Temporal interval of adjacent sampled frames in test mode. Default: 6.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
mode (str) – Choose ‘train’, ‘val’ or ‘test’ mode. Default: ‘train’.

class mmaction.datasets.pipelines.TenCrop(crop_size)[source]¶

Crop the images into 10 crops (corner + center + flip).

Crop the four corners and the center part of the image with the same given crop_size, and flip it horizontally. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

Parameters: crop_size (int | tuple[int]) – (w, h) of crop size.

class mmaction.datasets.pipelines.ThreeCrop(crop_size)[source]¶

Crop images into three crops.

Crop the images equally into three crops with equal intervals along the shorter side. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.

Parameters: crop_size (int | tuple[int]) – (w, h) of crop size.

class mmaction.datasets.pipelines.ToDataContainer(fields)[source]¶

Convert the data to DataContainer.

Parameters: fields (Sequence[dict]) – Required fields to be converted with keys and attributes. E.g. fields=(dict(key=’gt_bbox’, stack=False),). Note that key can also be a list of keys, if so, every tensor in the list will be converted to DataContainer.

class mmaction.datasets.pipelines.ToTensor(keys)[source]¶

Convert some values in results dict to torch.Tensor type in data loader pipeline.

Parameters: keys (Sequence[str]) – Required keys to be converted.

class mmaction.datasets.pipelines.Transpose(keys, order)[source]¶

Transpose image channels to a given order.

Parameters

keys (Sequence[str]) – Required keys to be converted.
order (Sequence[int]) – Image channel order.

class mmaction.datasets.pipelines.UntrimmedSampleFrames(clip_len=1, frame_interval=16, start_index=None)[source]¶

Sample frames from the untrimmed video.

Required keys are “filename”, “total_frames”, added or modified keys are “frame_inds”, “frame_interval” and “num_clips”.

Parameters

clip_len (int) – The length of sampled clips. Default: 1.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 16.
start_index (None) – This argument is deprecated and moved to dataset class (BaseDataset, VideoDatset, RawframeDataset, etc), see this: https://github.com/open-mmlab/mmaction2/pull/89.

samplers¶

class mmaction.datasets.samplers.DistributedPowerSampler(dataset, num_replicas=None, rank=None, power=1, seed=0)[source]¶

DistributedPowerSampler inheriting from torch.utils.data.DistributedSampler.

Samples are sampled with the probability that is proportional to the power of label frequency (freq ^ power). The sampler only applies to single class recognition dataset.

The default value of power is 1, which is equivalent to bootstrap sampling from the entire dataset.

class mmaction.datasets.samplers.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0)[source]¶

DistributedSampler inheriting from torch.utils.data.DistributedSampler.

In pytorch of lower versions, there is no shuffle argument. This child class will port one to DistributedSampler.

mmaction.utils¶

class mmaction.utils.GradCAM(model, target_layer_name, colormap='viridis')[source]¶

GradCAM class helps create visualization results.

Visualization results are blended by heatmaps and input images. This class is modified from https://github.com/facebookresearch/SlowFast/blob/master/slowfast/visualization/gradcam_utils.py # noqa For more information about GradCAM, please visit: https://arxiv.org/pdf/1610.02391.pdf

class mmaction.utils.PreciseBNHook(dataloader, num_iters=200, interval=1)[source]¶

Precise BN hook.

dataloader¶

A PyTorch dataloader.

Type: DataLoader

num_iters¶

Number of iterations to update the bn stats. Default: 200.

Type: int

interval¶

Perform precise bn interval (by epochs). Default: 1.

Type: int

mmaction.utils.get_random_string(length=15)[source]¶

Get random string with letters and digits.

Parameters: length (int) – Length of random string. Default: 15.

mmaction.utils.get_root_logger(log_file=None, log_level=20)[source]¶

Use get_logger method in mmcv to get the root logger.

The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmaction”.

Parameters

log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.
log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.

Returns

The root logger.

Return type

logging.Logger

mmaction.utils.get_shm_dir()[source]¶: Get shm dir for temporary usage.

mmaction.utils.get_thread_id()[source]¶: Get current thread id.

mmaction.utils.import_module_error_class(module_name)[source]¶: When a class is imported incorrectly due to a missing module, raise an import error when the class is instantiated.

mmaction.utils.import_module_error_func(module_name)[source]¶: When a function is imported incorrectly due to a missing module, raise an import error when the function is called.

mmaction.localization¶

mmaction.localization.eval_ap(detections, gt_by_cls, iou_range)[source]¶

Evaluate average precisions.

Parameters

detections (dict) – Results of detections.
gt_by_cls (dict) – Information of groudtruth.
iou_range (list) – Ranges of iou.

Returns

Average precision values of classes at ious.

Return type

list

mmaction.localization.generate_bsp_feature(video_list, video_infos, tem_results_dir, pgm_proposals_dir, top_k=1000, bsp_boundary_ratio=0.2, num_sample_start=8, num_sample_end=8, num_sample_action=16, num_sample_interp=3, tem_results_ext='.csv', pgm_proposal_ext='.csv', result_dict=None)[source]¶

Generate Boundary-Sensitive Proposal Feature with given proposals.

Parameters

video_list (list[int]) – List of video indexs to generate bsp_feature.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’.
tem_results_dir (str) – Directory to load temporal evaluation results.
pgm_proposals_dir (str) – Directory to load proposals.
top_k (int) – Number of proposals to be considered. Default: 1000
bsp_boundary_ratio (float) – Ratio for proposal boundary (start/end). Default: 0.2.
num_sample_start (int) – Num of samples for actionness in start region. Default: 8.
num_sample_end (int) – Num of samples for actionness in end region. Default: 8.
num_sample_action (int) – Num of samples for actionness in center region. Default: 16.
num_sample_interp (int) – Num of samples for interpolation for each sample point. Default: 3.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
pgm_proposal_ext (str) – File extension for proposals. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.

Returns

A dict contains video_name as keys and: bsp_feature as value. If result_dict is not None, save the results to it.

Return type

bsp_feature_dict (dict)

mmaction.localization.generate_candidate_proposals(video_list, video_infos, tem_results_dir, temporal_scale, peak_threshold, tem_results_ext='.csv', result_dict=None)[source]¶

Generate Candidate Proposals with given temporal evalutation results. Each proposal file will contain: ‘tmin,tmax,tmin_score,tmax_score,score,match_iou,match_ioa’.

Parameters

video_list (list[int]) – List of video indexs to generate proposals.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’, ‘duration_frame’, ‘duration_second’, ‘feature_frame’, and ‘annotations’.
tem_results_dir (str) – Directory to load temporal evaluation results.
temporal_scale (int) – The number (scale) on temporal axis.
peak_threshold (float) – The threshold for proposal generation.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.

Returns

A dict contains video_name as keys and proposal list as value.: If result_dict is not None, save the results to it.

Return type

dict

mmaction.localization.load_localize_proposal_file(filename)[source]¶

Load the proposal file and split it into many parts which contain one video’s information separately.

Parameters: filename (str) – Path to the proposal file.
Returns: List of all videos’ information.
Return type: list

mmaction.localization.perform_regression(detections)[source]¶

Perform regression on detection results.

Parameters: detections (list) – Detection results before regression.
Returns: Detection results after regression.
Return type: list

mmaction.localization.soft_nms(proposals, alpha, low_threshold, high_threshold, top_k)[source]¶

Soft NMS for temporal proposals.

Parameters

proposals (np.ndarray) – Proposals generated by network.
alpha (float) – Alpha value of Gaussian decaying function.
low_threshold (float) – Low threshold for soft nms.
high_threshold (float) – High threshold for soft nms.
top_k (int) – Top k values to be considered.

Returns

The updated proposals.

Return type

np.ndarray

mmaction.localization.temporal_iop(proposal_min, proposal_max, gt_min, gt_max)[source]¶

Compute IoP score between a groundtruth bbox and the proposals.

Compute the IoP which is defined as the overlap ratio with groundtruth proportional to the duration of this proposal.

Parameters

proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.

Returns

List of intersection over anchor scores.

Return type

list[float]

mmaction.localization.temporal_iou(proposal_min, proposal_max, gt_min, gt_max)[source]¶

Compute IoU score between a groundtruth bbox and the proposals.

Parameters

proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.

Returns

List of iou scores.

Return type

list[float]

mmaction.localization.temporal_nms(detections, threshold)[source]¶

Parse the video’s information.

Parameters

detections (list) – Detection results before NMS.
threshold (float) – Threshold of NMS.

Returns

Detection results after NMS.

Return type

list