API Reference¶
mmaction.apis¶
-
mmaction.apis.inference_recognizer(model, video_path, label_path, use_frames=False, outputs=None, as_tensor=True)[source]¶ Inference a video with the detector.
- Parameters
model (nn.Module) – The loaded recognizer.
video_path (str) – The video file path/url or the rawframes directory path. If
use_framesis set to True, it should be rawframes directory path. Otherwise, it should be video file path.label_path (str) – The label file path.
use_frames (bool) – Whether to use rawframes as input. Default:False.
outputs (list(str) | tuple(str) | str | None) – Names of layers whose outputs need to be returned, default: None.
as_tensor (bool) – Same as that in
OutputHook. Default: True.
- Returns
Top-5 recognition result dict. dict[torch.tensor | np.ndarray]:
Output feature maps from layers specified in outputs.
- Return type
dict[tuple(str, float)]
-
mmaction.apis.init_recognizer(config, checkpoint=None, device='cuda:0', use_frames=False)[source]¶ Initialize a recognizer from config file.
- Parameters
config (str |
mmcv.Config) – Config file path or the config object.checkpoint (str | None, optional) – Checkpoint path/url. If set to None, the model will not load any weights. Default: None.
device (str |
torch.device) – The desired device of returned tensor. Default: ‘cuda:0’.use_frames (bool) – Whether to use rawframes as input. Default:False.
- Returns
The constructed recognizer.
- Return type
nn.Module
-
mmaction.apis.multi_gpu_test(model, data_loader, tmpdir=None, gpu_collect=True)[source]¶ Test model with multiple gpus.
This method tests model with multiple gpus and collects the results under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’ it encodes results to gpu tensors and use gpu communication for results collection. On cpu mode it saves the results on different gpus to ‘tmpdir’ and collects them by the rank 0 worker.
- Parameters
model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
tmpdir (str) – Path of directory to save the temporary results from different gpus under cpu mode. Default: None
gpu_collect (bool) – Option to use either gpu or cpu to collect results. Default: True
- Returns
The prediction results.
- Return type
list
-
mmaction.apis.single_gpu_test(model, data_loader)[source]¶ Test model with a single gpu.
This method tests model with a single gpu and displays test progress bar.
- Parameters
model (nn.Module) – Model to be tested.
data_loader (nn.Dataloader) – Pytorch data loader.
- Returns
The prediction results.
- Return type
list
-
mmaction.apis.train_model(model, dataset, cfg, distributed=False, validate=False, test={'test_best': False, 'test_last': False}, timestamp=None, meta=None)[source]¶ Train model entry function.
- Parameters
model (nn.Module) – The model to be trained.
dataset (
Dataset) – Train dataset.cfg (dict) – The config dict for training.
distributed (bool) – Whether to use distributed training. Default: False.
validate (bool) – Whether to do evaluation. Default: False.
test (dict) – The testing option, with two keys: test_last & test_best. The value is True or False, indicating whether to test the corresponding checkpoint. Default: dict(test_best=False, test_last=False).
timestamp (str | None) – Local time for runner. Default: None.
meta (dict | None) – Meta dict to record some important information. Default: None
mmaction.core¶
optimizer¶
-
class
mmaction.core.optimizer.CopyOfSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]¶ A clone of torch.optim.SGD.
A customized optimizer could be defined like CopyOfSGD. You may derive from built-in optimizers in torch.optim, or directly implement a new optimizer.
-
class
mmaction.core.optimizer.TSMOptimizerConstructor(optimizer_cfg, paramwise_cfg=None)[source]¶ Optimizer constructor in TSM model.
This constructor builds optimizer in different ways from the default one.
Parameters of the first conv layer have default lr and weight decay.
Parameters of BN layers have default lr and zero weight decay.
If the field “fc_lr5” in paramwise_cfg is set to True, the parameters of the last fc layer in cls_head have 5x lr multiplier and 10x weight decay multiplier.
Weights of other layers have default lr and weight decay, and biases have a 2x lr multiplier and zero weight decay.
evaluation¶
-
class
mmaction.core.evaluation.ActivityNetLocalization(ground_truth_filename=None, prediction_filename=None, tiou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]), verbose=False)[source]¶ Class to evaluate detection results on ActivityNet.
- Parameters
ground_truth_filename (str | None) – The filename of groundtruth. Default: None.
prediction_filename (str | None) – The filename of action detection results. Default: None.
tiou_thresholds (np.ndarray) – The thresholds of temporal iou to evaluate. Default:
np.linspace(0.5, 0.95, 10).verbose (bool) – Whether to print verbose logs. Default: False.
-
class
mmaction.core.evaluation.DistEvalHook(dataloader, start=None, interval=1, by_epoch=True, save_best='auto', rule=None, broadcast_bn_buffer=True, tmpdir=None, gpu_collect=False, **eval_kwargs)[source]¶ Distributed evaluation hook.
This hook will regularly perform evaluation in a given interval when performing in distributed environment.
- Parameters
dataloader (DataLoader) – A PyTorch dataloader.
start (int | None, optional) – Evaluation starting epoch. It enables evaluation before the training starts if
start<= the resuming epoch. If None, whether to evaluate is merely decided byinterval. Default: None.interval (int) – Evaluation interval. Default: 1.
by_epoch (bool) – Determine perform evaluation by epoch or by iteration. If set to True, it will perform by epoch. Otherwise, by iteration. default: True.
save_best (str | None, optional) –
If a metric is specified, it would measure the best checkpoint during evaluation. The information about best checkpoint would be save in best.json. Options are the evaluation metrics to the test dataset. e.g.,
top1_acc,top5_acc,mean_class_accuracy,mean_average_precision,mmit_mean_average_precisionfor action recognition dataset (RawframeDataset and VideoDataset).AR@AN,aucfor action localization dataset (ActivityNetDataset).mAP@0.5IOUfor spatio-temporal action detection dataset (AVADataset). Ifsave_bestisauto, the first key of the returnedOrderedDictresult will be used. Default: ‘auto’.rule (str | None, optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. Options are ‘greater’, ‘less’, None. Default: None.
tmpdir (str | None) – Temporary directory to save the results of all processes. Default: None.
gpu_collect (bool) – Whether to use gpu or cpu to collect results. Default: False.
broadcast_bn_buffer (bool) – Whether to broadcast the buffer(running_mean and running_var) of rank 0 to other rank before evaluation. Default: True.
**eval_kwargs – Evaluation arguments fed into the evaluate function of the dataset.
-
class
mmaction.core.evaluation.EvalHook(dataloader, start=None, interval=1, by_epoch=True, save_best='auto', rule=None, **eval_kwargs)[source]¶ Non-Distributed evaluation hook.
Notes
If new arguments are added for EvalHook, tools/test.py, tools/eval_metric.py may be effected.
This hook will regularly perform evaluation in a given interval when performing in non-distributed environment.
- Parameters
dataloader (DataLoader) – A PyTorch dataloader.
start (int | None, optional) – Evaluation starting epoch. It enables evaluation before the training starts if
start<= the resuming epoch. If None, whether to evaluate is merely decided byinterval. Default: None.interval (int) – Evaluation interval. Default: 1.
by_epoch (bool) – Determine perform evaluation by epoch or by iteration. If set to True, it will perform by epoch. Otherwise, by iteration. default: True.
save_best (str | None, optional) –
If a metric is specified, it would measure the best checkpoint during evaluation. The information about best checkpoint would be save in best.json. Options are the evaluation metrics to the test dataset. e.g.,
top1_acc,top5_acc,mean_class_accuracy,mean_average_precision,mmit_mean_average_precisionfor action recognition dataset (RawframeDataset and VideoDataset).AR@AN,aucfor action localization dataset. (ActivityNetDataset).mAP@0.5IOUfor spatio-temporal action detection dataset (AVADataset). Ifsave_bestisauto, the first key of the returnedOrderedDictresult will be used. Default: ‘auto’.rule (str | None, optional) – Comparison rule for best score. If set to None, it will infer a reasonable rule. Keys such as ‘acc’, ‘top’ .etc will be inferred by ‘greater’ rule. Keys contain ‘loss’ will be inferred by ‘less’ rule. Options are ‘greater’, ‘less’, None. Default: None.
**eval_kwargs – Evaluation arguments fed into the evaluate function of the dataset.
-
mmaction.core.evaluation.average_precision_at_temporal_iou(ground_truth, prediction, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]¶ Compute average precision (in detection task) between ground truth and predicted data frames. If multiple predictions match the same predicted segment, only the one with highest score is matched as true positive. This code is greatly inspired by Pascal VOC devkit.
- Parameters
ground_truth (dict) – Dict containing the ground truth instances. Key: ‘video_id’ Value (np.ndarray): 1D array of ‘t-start’ and ‘t-end’.
prediction (np.ndarray) – 2D array containing the information of proposal instances, including ‘video_id’, ‘class_id’, ‘t-start’, ‘t-end’ and ‘score’.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default:
np.linspace(0.5, 0.95, 10).
- Returns
1D array of average precision score.
- Return type
np.ndarray
-
mmaction.core.evaluation.average_recall_at_avg_proposals(ground_truth, proposals, total_num_proposals, max_avg_proposals=None, temporal_iou_thresholds=array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]))[source]¶ Computes the average recall given an average number (percentile) of proposals per video.
- Parameters
ground_truth (dict) – Dict containing the ground truth instances.
proposals (dict) – Dict containing the proposal instances.
total_num_proposals (int) – Total number of proposals in the proposal dict.
max_avg_proposals (int | None) – Max number of proposals for one video. Default: None.
temporal_iou_thresholds (np.ndarray) – 1D array with temporal_iou thresholds. Default:
np.linspace(0.5, 0.95, 10).
- Returns
(recall, average_recall, proposals_per_video, auc) In recall,
recall[i,j]is recall at i-th temporal_iou threshold at the j-th average number (percentile) of average number of proposals per video. The average_recall is recall averaged over a list of temporal_iou threshold (1D array). This is equivalent torecall.mean(axis=0). Theproposals_per_videois the average number of proposals per video. The auc is the area underAR@ANcurve.- Return type
tuple([np.ndarray, np.ndarray, np.ndarray, float])
-
mmaction.core.evaluation.confusion_matrix(y_pred, y_real, normalize=None)[source]¶ Compute confusion matrix.
- Parameters
y_pred (list[int] | np.ndarray[int]) – Prediction labels.
y_real (list[int] | np.ndarray[int]) – Ground truth labels.
normalize (str | None) – Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population. If None, confusion matrix will not be normalized. Options are “true”, “pred”, “all”, None. Default: None.
- Returns
Confusion matrix.
- Return type
np.ndarray
-
mmaction.core.evaluation.get_weighted_score(score_list, coeff_list)[source]¶ Get weighted score with given scores and coefficients.
Given n predictions by different classifier: [score_1, score_2, …, score_n] (score_list) and their coefficients: [coeff_1, coeff_2, …, coeff_n] (coeff_list), return weighted score: weighted_score = score_1 * coeff_1 + score_2 * coeff_2 + … + score_n * coeff_n
- Parameters
score_list (list[list[np.ndarray]]) – List of list of scores, with shape n(number of predictions) X num_samples X num_classes
coeff_list (list[float]) – List of coefficients, with shape n.
- Returns
List of weighted scores.
- Return type
list[np.ndarray]
-
mmaction.core.evaluation.interpolated_precision_recall(precision, recall)[source]¶ Interpolated AP - VOCdevkit from VOC 2011.
- Parameters
precision (np.ndarray) – The precision of different thresholds.
recall (np.ndarray) – The recall of different thresholds.
- Returns:
float: Average precision score.
-
mmaction.core.evaluation.mean_average_precision(scores, labels)[source]¶ Mean average precision for multi-label recognition.
- Parameters
scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.
- Returns
The mean average precision.
- Return type
np.float
-
mmaction.core.evaluation.mean_class_accuracy(scores, labels)[source]¶ Calculate mean class accuracy.
- Parameters
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
- Returns
Mean class accuracy.
- Return type
np.ndarray
-
mmaction.core.evaluation.mmit_mean_average_precision(scores, labels)[source]¶ Mean average precision for multi-label recognition. Used for reporting MMIT style mAP on Multi-Moments in Times. The difference is that this method calculates average-precision for each sample and averages them among samples.
- Parameters
scores (list[np.ndarray]) – Prediction scores of different classes for each sample.
labels (list[np.ndarray]) – Ground truth many-hot vector for each sample.
- Returns
The MMIT style mean average precision.
- Return type
np.float
-
mmaction.core.evaluation.pairwise_temporal_iou(candidate_segments, target_segments, calculate_overlap_self=False)[source]¶ Compute intersection over union between segments.
- Parameters
candidate_segments (np.ndarray) – 1-dim/2-dim array in format
[init, end]/[m x 2:=[init, end]].target_segments (np.ndarray) – 2-dim array in format
[n x 2:=[init, end]].calculate_overlap_self (bool) – Whether to calculate overlap_self (union / candidate_length) or not. Default: False.
- Returns
- 1-dim array [n] /
2-dim array [n x m] with IoU ratio.
- t_overlap_self (np.ndarray, optional): 1-dim array [n] /
2-dim array [n x m] with overlap_self, returns when calculate_overlap_self is True.
- Return type
t_iou (np.ndarray)
-
mmaction.core.evaluation.softmax(x, dim=1)[source]¶ Compute softmax values for each sets of scores in x.
-
mmaction.core.evaluation.top_k_accuracy(scores, labels, topk=(1))[source]¶ Calculate top k accuracy score.
- Parameters
scores (list[np.ndarray]) – Prediction scores for each class.
labels (list[int]) – Ground truth labels.
topk (tuple[int]) – K value for top_k_accuracy. Default: (1, ).
- Returns
Top k accuracy score for each k.
- Return type
list[float]
mmaction.localization¶
localization¶
-
mmaction.localization.eval_ap(detections, gt_by_cls, iou_range)[source]¶ Evaluate average precisions.
- Parameters
detections (dict) – Results of detections.
gt_by_cls (dict) – Information of groudtruth.
iou_range (list) – Ranges of iou.
- Returns
Average precision values of classes at ious.
- Return type
list
-
mmaction.localization.generate_bsp_feature(video_list, video_infos, tem_results_dir, pgm_proposals_dir, top_k=1000, bsp_boundary_ratio=0.2, num_sample_start=8, num_sample_end=8, num_sample_action=16, num_sample_interp=3, tem_results_ext='.csv', pgm_proposal_ext='.csv', result_dict=None)[source]¶ Generate Boundary-Sensitive Proposal Feature with given proposals.
- Parameters
video_list (list[int]) – List of video indexs to generate bsp_feature.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’.
tem_results_dir (str) – Directory to load temporal evaluation results.
pgm_proposals_dir (str) – Directory to load proposals.
top_k (int) – Number of proposals to be considered. Default: 1000
bsp_boundary_ratio (float) – Ratio for proposal boundary (start/end). Default: 0.2.
num_sample_start (int) – Num of samples for actionness in start region. Default: 8.
num_sample_end (int) – Num of samples for actionness in end region. Default: 8.
num_sample_action (int) – Num of samples for actionness in center region. Default: 16.
num_sample_interp (int) – Num of samples for interpolation for each sample point. Default: 3.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
pgm_proposal_ext (str) – File extension for proposals. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.
- Returns
- A dict contains video_name as keys and
bsp_feature as value. If result_dict is not None, save the results to it.
- Return type
bsp_feature_dict (dict)
-
mmaction.localization.generate_candidate_proposals(video_list, video_infos, tem_results_dir, temporal_scale, peak_threshold, tem_results_ext='.csv', result_dict=None)[source]¶ Generate Candidate Proposals with given temporal evalutation results. Each proposal file will contain: ‘tmin,tmax,tmin_score,tmax_score,score,match_iou,match_ioa’.
- Parameters
video_list (list[int]) – List of video indexs to generate proposals.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’, ‘duration_frame’, ‘duration_second’, ‘feature_frame’, and ‘annotations’.
tem_results_dir (str) – Directory to load temporal evaluation results.
temporal_scale (int) – The number (scale) on temporal axis.
peak_threshold (float) – The threshold for proposal generation.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.
- Returns
- A dict contains video_name as keys and proposal list as value.
If result_dict is not None, save the results to it.
- Return type
dict
-
mmaction.localization.load_localize_proposal_file(filename)[source]¶ Load the proposal file and split it into many parts which contain one video’s information separately.
- Parameters
filename (str) – Path to the proposal file.
- Returns
List of all videos’ information.
- Return type
list
-
mmaction.localization.perform_regression(detections)[source]¶ Perform regression on detection results.
- Parameters
detections (list) – Detection results before regression.
- Returns
Detection results after regression.
- Return type
list
-
mmaction.localization.soft_nms(proposals, alpha, low_threshold, high_threshold, top_k)[source]¶ Soft NMS for temporal proposals.
- Parameters
proposals (np.ndarray) – Proposals generated by network.
alpha (float) – Alpha value of Gaussian decaying function.
low_threshold (float) – Low threshold for soft nms.
high_threshold (float) – High threshold for soft nms.
top_k (int) – Top k values to be considered.
- Returns
The updated proposals.
- Return type
np.ndarray
-
mmaction.localization.temporal_iop(proposal_min, proposal_max, gt_min, gt_max)[source]¶ Compute IoP score between a groundtruth bbox and the proposals.
Compute the IoP which is defined as the overlap ratio with groundtruth proportional to the duration of this proposal.
- Parameters
proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.
- Returns
List of intersection over anchor scores.
- Return type
list[float]
-
mmaction.localization.temporal_iou(proposal_min, proposal_max, gt_min, gt_max)[source]¶ Compute IoU score between a groundtruth bbox and the proposals.
- Parameters
proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.
- Returns
List of iou scores.
- Return type
list[float]
mmaction.models¶
models¶
-
class
mmaction.models.AudioRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶ Audio recognizer model framework.
-
forward(audios, label=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
forward_gradcam(audios)[source]¶ Defines the computation performed at every all when using gradcam utils.
-
forward_test(audios)[source]¶ Defines the computation performed at every call when evaluation and testing.
-
forward_train(audios, labels)[source]¶ Defines the computation performed at every call when training.
-
train_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss,log_vars, num_samples.lossis a tensor for back propagation, which can be a weighted sum of multiple losses.log_varscontains all the variables to be sent to the logger.num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
-
val_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
-
-
class
mmaction.models.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶ Classification head for TSN on audio.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]¶ Simplest RoI head, with only two fc layers for classification and regression respectively.
- Parameters
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
in_channels (int) – The number of input channels. Default: 2048.
num_classes (int) – The number of classes. Default: 81.
dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.
dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.
topk (int or tuple[int]) – Parameter for evaluating multilabel accuracy. Default: (3, 5)
multilabel (bool) – Whether used for a multilabel task. Default: True. (Only support multilabel == True now).
-
forward(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]¶ Binary Cross Entropy Loss with logits.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
-
class
mmaction.models.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]¶ Boundary Matching Network for temporal action proposal generation.
Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network
- Parameters
temporal_dim (int) – Total frames selected for each video.
boundary_ratio (float) – Ratio for determining video boundaries.
num_samples (int) – Number of samples for each proposal.
num_samples_per_bin (int) – Number of bin samples for each sample.
feat_dim (int) – Feature dimension.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
loss_cls (dict) – Config for building loss. Default:
dict(type='BMNLoss').hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.
-
forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
forward_test(raw_feature, video_meta)[source]¶ Define the computation performed at every call when testing.
-
class
mmaction.models.BMNLoss[source]¶ BMN Loss.
From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of
1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.
-
forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]¶ Calculate Boundary Matching Network Loss.
- Parameters
pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.
- Returns
(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.
- Return type
tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])
-
static
pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]¶ Calculate Proposal Evaluation Module Classification Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5
- Returns
Proposal evalutaion classification loss.
- Return type
torch.Tensor
-
static
pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]¶ Calculate Proposal Evaluation Module Regression Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.
- Returns
Proposal evalutaion regression loss.
- Return type
torch.Tensor
-
static
tem_loss(pred_start, pred_end, gt_start, gt_end)[source]¶ Calculate Temporal Evaluation Module Loss.
This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.
- Parameters
pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
-
-
class
mmaction.models.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0)[source]¶ Base class for head.
All Head should subclass it. All subclass should overwrite: - Methods:
init_weights, initializing weights in some modules. - Methods:forward, supporting to forward both for training and testing.- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.
-
abstract
init_weights()[source]¶ Initiate the parameters either from existing checkpoint or from scratch.
-
loss(cls_score, labels, **kwargs)[source]¶ Calculate the loss given output
cls_score, targetlabels.- Parameters
cls_score (torch.Tensor) – The output of the model.
labels (torch.Tensor) – The target output of the model.
- Returns
A dict containing field ‘loss_cls’(mandatory) and ‘top1_acc’, ‘top5_acc’(optional).
- Return type
dict
-
class
mmaction.models.BaseRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶ Base class for recognizers.
All recognizers should subclass it. All subclass should overwrite:
Methods:
forward_train, supporting to forward when training.Methods:
forward_test, supporting to forward when testing.
- Parameters
backbone (dict) – Backbone modules to extract feature.
cls_head (dict) – Classification head to process feature.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
-
average_clip(cls_score, num_segs=1)[source]¶ Averaging class score over multiple clips.
Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.
- Parameters
cls_score (torch.Tensor) – Class score to be averaged.
num_segs (int) – Number of clips for each input sample.
- Returns
Averaged class score.
- Return type
torch.Tensor
-
extract_feat(imgs)[source]¶ Extract features through a backbone.
- Parameters
imgs (torch.Tensor) – The input images.
- Returns
The extracted features.
- Return type
torch.tensor
-
forward(imgs, label=None, return_loss=True, **kwargs)[source]¶ Define the computation performed at every call.
-
abstract
forward_gradcam(imgs)[source]¶ Defines the computation performed at every all when using gradcam utils.
-
abstract
forward_test(imgs)[source]¶ Defines the computation performed at every call when evaluation and testing.
-
abstract
forward_train(imgs, labels, **kwargs)[source]¶ Defines the computation performed at every call when training.
-
train_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss,log_vars, num_samples.lossis a tensor for back propagation, which can be a weighted sum of multiple losses.log_varscontains all the variables to be sent to the logger.num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
-
val_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
-
property
with_neck¶ whether the detector has a neck
- Type
bool
-
class
mmaction.models.BinaryLogisticRegressionLoss[source]¶ Binary Logistic Regression Loss.
It will calculate binary logistic regression loss given reg_score and label.
-
forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]¶ Calculate Binary Logistic Regression Loss.
- Parameters
reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
-
-
class
mmaction.models.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, dropout_ratio=0.5, init_std=0.005)[source]¶ C3D backbone.
- Parameters
pretrained (str | None) – Name of pretrained model.
style (str) –
pytorchorcaffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses
dict(type='Conv3d')to construct layers. Default: None.norm_cfg (dict | None) – Config for norm layers. required keys are
type, Default: None.act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses
dict(type='ReLU')to construct layers. Default: None.dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.
-
class
mmaction.models.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]¶ (2+1)d Conv module for R(2+1)d backbone.
https://arxiv.org/pdf/1711.11248.pdf.
- Parameters
in_channels (int) – Same as nn.Conv3d.
out_channels (int) – Same as nn.Conv3d.
kernel_size (int | tuple[int]) – Same as nn.Conv3d.
stride (int | tuple[int]) – Same as nn.Conv3d.
padding (int | tuple[int]) – Same as nn.Conv3d.
dilation (int | tuple[int]) – Same as nn.Conv3d.
groups (int) – Same as nn.Conv3d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
-
class
mmaction.models.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]¶ Conv2d module for AudioResNet backbone.
- Parameters
in_channels (int) – Same as nn.Conv2d.
out_channels (int) – Same as nn.Conv2d.
kernel_size (int | tuple[int]) – Same as nn.Conv2d.
op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.
stride (int | tuple[int]) – Same as nn.Conv2d.
padding (int | tuple[int]) – Same as nn.Conv2d.
dilation (int | tuple[int]) – Same as nn.Conv2d.
groups (int) – Same as nn.Conv2d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
-
class
mmaction.models.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]¶ Cross Entropy Loss.
Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of
cls_scoreandlabel. 1) Hard label: This label is an integer array and all of the elements arein the range [0, num_classes - 1]. This label’s shape should be
cls_score’s shape with the num_classes dimension removed.- Soft label(probablity distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as
cls_score. For now, only 2-dim soft label is supported.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
-
class
mmaction.models.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max')[source]¶ Feature Bank Operator Head.
Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.
- Parameters
lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.
fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
-
forward(x, rois, img_metas)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]¶ Calculate the BCELoss for HVU.
- Parameters
categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.
-
class
mmaction.models.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]¶ Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]¶ Long-Term Feature Bank (LFB).
LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding
The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.
Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:
- Parameters
lfb_prefix_path (str) – The storage path of lfb.
max_num_sampled_feat (int) – The max number of sampled features. Default: 5.
window_size (int) – Window size of sampling long term feature. Default: 60.
lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.
dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).
device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.
lmdb_map_size (int) – Map size of lmdb. Default: 4e9.
construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.
-
class
mmaction.models.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max')[source]¶ Long-Term Feature Bank Infer Head.
This head is used to derive and save the LFB without affecting the input.
- Parameters
lfb_prefix_path (str) – The prefix path to store the lfb.
dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.
use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
-
forward(x, rois, img_metas)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]¶ MobileNetV2 backbone.
- Parameters
pretrained (str | None) – Name of pretrained model. Default: None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.
conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
-
forward(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
make_layer(out_channels, num_blocks, stride, expand_ratio)[source]¶ Stack InvertedResidual blocks to build a layer for MobileNetV2.
- Parameters
out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Default: 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.
-
train(mode=True)[source]¶ Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout,BatchNorm, etc.- Parameters
mode (bool) – whether to set training mode (
True) or evaluation mode (False). Default:True.- Returns
self
- Return type
Module
-
class
mmaction.models.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]¶ MobileNetV2 backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
shift_div (int) – Number of div for shift. Default: 8.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.
-
class
mmaction.models.NLLLoss(loss_weight=1.0)[source]¶ NLL Loss.
It will calculate NLL loss given cls_score and label.
-
class
mmaction.models.OHEMHingeLoss[source]¶ This class is the core implementation for the completeness loss in paper.
It compute class-wise hinge loss and performs online hard example mining (OHEM).
-
static
backward(ctx, grad_output)[source]¶ Defines a formula for differentiating the operation.
This function is to be overridden by all subclasses.
It must accept a context
ctxas the first argument, followed by as many outputs didforward()return, and it should return as many tensors, as there were inputs toforward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_gradas a tuple of booleans representing whether each input needs gradient. E.g.,backward()will havectx.needs_input_grad[0] = Trueif the first input toforward()needs gradient computated w.r.t. the output.
-
static
forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]¶ Calculate OHEM hinge loss.
- Parameters
pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.
- Returns
Returned class-wise hinge loss.
- Return type
torch.Tensor
-
static
-
class
mmaction.models.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]¶ Proposals Evaluation Model for Boundary Sensetive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
pem_feat_dim (int) – Feature dimension.
pem_hidden_dim (int) – Hidden layer dimension.
pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.
-
forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
class
mmaction.models.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]¶ ResNet backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
dilations (Sequence[int]) – Dilation of each stage.
style (str) –
pytorchorcaffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch.frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
partial_bn (bool) – Whether to use partial bn. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
-
class
mmaction.models.ResNet2Plus1d(*args, **kwargs)[source]¶ ResNet (2+1)d backbone.
This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition
-
class
mmaction.models.ResNet3d(depth, pretrained, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(5, 7, 7), conv1_stride_t=2, pool1_stride_t=2, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]¶ ResNet 3d backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
in_channels (int) – Channel num of input features. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2).temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default:
(1, 1, 1, 1).dilations (Sequence[int]) – Dilation of each stage. Default:
(1, 1, 1, 1).conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default:
(5, 7, 7).conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 2.
with_pool2 (bool) – Whether to use pool2. Default: True.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).
inflate_style (str) –
3x1x1or1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
typeDefault:dict(type='Conv3d').norm_cfg (dict) – Config for norm layers. required keys are
typeandrequires_grad. Default:dict(type='BN3d', requires_grad=True).act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True).norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict().zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
forward(x)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
-
static
make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶ Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.
dilation (int) – Spacing between kernel elements. Default: 1.
style (str) –
pytorchorcaffe. If set topytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch.inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.
inflate_style (str) –
3x1x1or1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.
non_local_cfg (dict) – Config for non-local module. Default:
dict().conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
-
class
mmaction.models.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]¶ ResNet backbone for CSN.
- Parameters
depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.
If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.
Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
class
mmaction.models.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶ ResNet 3d Layer.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
stage (int) – The index of Resnet stage. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
spatial_stride (int) – The 1st res block’s spatial stride. Default 2.
temporal_stride (int) – The 1st res block’s temporal stride. Default 1.
dilation (int) – The dilation. Default: 1.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
all_frozen (bool) – Frozen all modules in the layer. Default: False.
inflate (int) – Inflate Dims of each block. Default: 1.
inflate_style (str) –
3x1x1or1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
typeDefault:dict(type='Conv3d').norm_cfg (dict) – Config for norm layers. required keys are
typeandrequires_grad. Default:dict(type='BN3d', requires_grad=True).act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True).norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
class
mmaction.models.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]¶ Slowfast backbone.
This module is proposed in SlowFast Networks for Video Recognition
- Parameters
pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride
resample_rateon input frames. The actual resample rate is calculated by multipling theintervalinSampleFramesin the pipeline withresample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out ofresample_rate * intervalframes. Default: 8.speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.
channel_ratio (int) – Reduce the channel number of fast pathway by
channel_ratio, corresponding to \(\beta\) in the paper. Default: 8.slow_pathway (dict) –
Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:
dict(type='ResNetPathway', lateral=True, depth=50, pretrained=None, conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
fast_pathway (dict) –
Configuration of fast branch, similar to slow_pathway. Default:
dict(type='ResNetPathway', lateral=False, depth=50, pretrained=None, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
-
class
mmaction.models.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]¶ SlowOnly backbone based on ResNet3dPathway.
- Parameters
*args (arguments) – Arguments same as
ResNet3dPathway.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).
**kwargs (keyword arguments) – Keywords arguments for
ResNet3dPathway.
-
class
mmaction.models.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]¶ ResNet 2d audio backbone. Reference:
- Parameters
depth (int) – Depth of resnet, from {50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
in_channels (int) – Channel num of input features. Default: 1.
base_channels (int) – Channel num of stem output features. Default: 32.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.
conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
-
forward(x)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
-
make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]¶ Build residual layer for ResNetAudio.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilation (int) – Spacing between kernel elements. Default: 1.
factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
-
class
mmaction.models.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]¶ ResNet backbone for TIN.
- Parameters
depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.
-
class
mmaction.models.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]¶ ResNet backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict().shift_div (int) – Number of div for shift. Default: 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Default: False.
**kwargs (keyword arguments, optional) – Arguments for ResNet.
-
class
mmaction.models.SSNLoss[source]¶ -
static
activity_loss(activity_score, labels, activity_indexer)[source]¶ Activity Loss.
It will calculate activity loss given activity_score and label.
- Args:
activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.
- Returns
Returned cross entropy loss.
- Return type
torch.Tensor
-
static
classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]¶ Classwise Regression Loss.
It will calculate classwise_regression loss given class_reg_pred and targets.
- Args:
- bbox_pred (torch.Tensor): Predicted interval center and span
of positive proposals.
labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span
of positive proposals.
- regression_indexer (torch.Tensor): Index slices of
positive proposals.
- Returns
Returned class-wise regression loss.
- Return type
torch.Tensor
-
static
completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]¶ Completeness Loss.
It will calculate completeness loss given completeness_score and label.
- Args:
completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and
incomplete proposals.
- positive_per_video (int): Number of positive proposals sampled
per video.
- incomplete_per_video (int): Number of incomplete proposals sampled
pre video.
- ohem_ratio (float): Ratio of online hard example mining.
Default: 0.17.
- Returns
Returned class-wise completeness loss.
- Return type
torch.Tensor
-
forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]¶ Calculate Boundary Matching Network Loss.
- Parameters
activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.
- Returns
(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.
- Return type
dict([torch.Tensor, torch.Tensor, torch.Tensor])
-
static
-
class
mmaction.models.SingleRoIExtractor3D(roi_layer_type='RoIAlign', featmap_stride=16, output_size=16, sampling_ratio=0, pool_mode='avg', aligned=True, with_temporal_pool=True, with_global=False)[source]¶ Extract RoI features from a single level feature map.
- Parameters
roi_layer_type (str) – Specify the RoI layer type. Default: ‘RoIAlign’.
featmap_stride (int) – Strides of input feature maps. Default: 16.
output_size (int | tuple) – Size or (Height, Width). Default: 16.
sampling_ratio (int) – number of inputs samples to take for each output sample. 0 to take samples densely for current models. Default: 0.
pool_mode (str, 'avg' or 'max') – pooling mode in each bin. Default: ‘avg’.
aligned (bool) – if False, use the legacy implementation in MMDetection. If True, align the results more perfectly. Default: True.
with_temporal_pool (bool) – if True, avgpool the temporal dim. Default: True.
with_global (bool) – if True, concatenate the RoI feature with global feature. Default: False.
Note that sampling_ratio, pool_mode, aligned only apply when roi_layer_type is set as RoIAlign.
-
forward(feat, rois)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]¶ The classification head for SlowFast.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]¶ Temporal Adaptive Module(TAM) for TANet.
This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
- Parameters
in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) –
`alpha`in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.adaptive_kernel_size (int) –
`K`in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.beta (int) –
`beta`in the paper and is set to control the model complexity in the local branch. Default: 4.conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of
`Temporal Adaptive Aggregation`. Default: 1.adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of
`Temporal Adaptive Aggregation`. Default: 1.init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.
-
class
mmaction.models.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]¶ Temporal Adaptive Network (TANet) backbone.
This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments.
tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().
**kwargs (keyword arguments, optional) – Arguments for ResNet except
`depth`.
-
class
mmaction.models.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]¶ Temporal Evaluation Model for Boundary Sensetive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
tem_feat_dim (int) – Feature dimension.
tem_hidden_dim (int) – Hidden layer dimension.
tem_match_threshold (float) – Temporal evaluation match threshold.
loss_cls (dict) – Config for building loss. Default:
dict(type='BinaryLogisticRegressionLoss').loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.
-
forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
forward_test(raw_feature, video_meta)[source]¶ Define the computation performed at every call when testing.
-
class
mmaction.models.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]¶ TPN neck.
This module is proposed in Temporal Pyramid Network for Action Recognition
- Parameters
in_channels (tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.
temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.
upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:
nn.Upsample. Default: None.downsample_cfg (dict | None) – Config for downsample layers. Default: None.
level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.
aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.
flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.
-
forward(x, target=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.TPNHead(*args, **kwargs)[source]¶ Class head for TPN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.
-
forward(x, num_segs=None, fcn_test=False)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int | None) – Number of segments into which a video is divided. Default: None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
-
class
mmaction.models.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]¶ Class head for TRN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
forward(x, num_segs)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
-
class
mmaction.models.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]¶ Class head for TSM.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
forward(x, num_segs)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
-
class
mmaction.models.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶ Class head for TSN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶ X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.
- Parameters
gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2).frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are
typeDefault:dict(type='Conv3d').norm_cfg (dict) – Config for norm layers. required keys are
typeandrequires_grad. Default:dict(type='BN3d', requires_grad=True).act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True).norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
forward(x)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
-
make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶ Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
-
class
mmaction.models.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]¶ Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.
recognizers¶
-
class
mmaction.models.recognizers.AudioRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶ Audio recognizer model framework.
-
forward(audios, label=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
forward_gradcam(audios)[source]¶ Defines the computation performed at every all when using gradcam utils.
-
forward_test(audios)[source]¶ Defines the computation performed at every call when evaluation and testing.
-
forward_train(audios, labels)[source]¶ Defines the computation performed at every call when training.
-
train_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss,log_vars, num_samples.lossis a tensor for back propagation, which can be a weighted sum of multiple losses.log_varscontains all the variables to be sent to the logger.num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
-
val_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
-
-
class
mmaction.models.recognizers.BaseRecognizer(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶ Base class for recognizers.
All recognizers should subclass it. All subclass should overwrite:
Methods:
forward_train, supporting to forward when training.Methods:
forward_test, supporting to forward when testing.
- Parameters
backbone (dict) – Backbone modules to extract feature.
cls_head (dict) – Classification head to process feature.
train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
-
average_clip(cls_score, num_segs=1)[source]¶ Averaging class score over multiple clips.
Using different averaging types (‘score’ or ‘prob’ or None, which defined in test_cfg) to computed the final averaged class score. Only called in test mode.
- Parameters
cls_score (torch.Tensor) – Class score to be averaged.
num_segs (int) – Number of clips for each input sample.
- Returns
Averaged class score.
- Return type
torch.Tensor
-
extract_feat(imgs)[source]¶ Extract features through a backbone.
- Parameters
imgs (torch.Tensor) – The input images.
- Returns
The extracted features.
- Return type
torch.tensor
-
forward(imgs, label=None, return_loss=True, **kwargs)[source]¶ Define the computation performed at every call.
-
abstract
forward_gradcam(imgs)[source]¶ Defines the computation performed at every all when using gradcam utils.
-
abstract
forward_test(imgs)[source]¶ Defines the computation performed at every call when evaluation and testing.
-
abstract
forward_train(imgs, labels, **kwargs)[source]¶ Defines the computation performed at every call when training.
-
train_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss,log_vars, num_samples.lossis a tensor for back propagation, which can be a weighted sum of multiple losses.log_varscontains all the variables to be sent to the logger.num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
-
val_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
-
property
with_neck¶ whether the detector has a neck
- Type
bool
-
class
mmaction.models.recognizers.Recognizer2D(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶ 2D recognizer model framework.
-
forward_dummy(imgs, softmax=False)[source]¶ Used for computing network FLOPs.
See
tools/analysis/get_flops.py.- Parameters
imgs (torch.Tensor) – Input images.
- Returns
Class score.
- Return type
Tensor
-
forward_gradcam(imgs)[source]¶ Defines the computation performed at every call when using gradcam utils.
-
-
class
mmaction.models.recognizers.Recognizer3D(backbone, cls_head, neck=None, train_cfg=None, test_cfg=None)[source]¶ 3D recognizer model framework.
-
forward_dummy(imgs, softmax=False)[source]¶ Used for computing network FLOPs.
See
tools/analysis/get_flops.py.- Parameters
imgs (torch.Tensor) – Input images.
- Returns
Class score.
- Return type
Tensor
-
forward_gradcam(imgs)[source]¶ Defines the computation performed at every call when using gradcam utils.
-
localizers¶
-
class
mmaction.models.localizers.BMN(temporal_dim, boundary_ratio, num_samples, num_samples_per_bin, feat_dim, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, loss_cls={'type': 'BMNLoss'}, hidden_dim_1d=256, hidden_dim_2d=128, hidden_dim_3d=512)[source]¶ Boundary Matching Network for temporal action proposal generation.
Please refer BMN: Boundary-Matching Network for Temporal Action Proposal Generation. Code Reference https://github.com/JJBOY/BMN-Boundary-Matching-Network
- Parameters
temporal_dim (int) – Total frames selected for each video.
boundary_ratio (float) – Ratio for determining video boundaries.
num_samples (int) – Number of samples for each proposal.
num_samples_per_bin (int) – Number of bin samples for each sample.
feat_dim (int) – Feature dimension.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
loss_cls (dict) – Config for building loss. Default:
dict(type='BMNLoss').hidden_dim_1d (int) – Hidden dim for 1d conv. Default: 256.
hidden_dim_2d (int) – Hidden dim for 2d conv. Default: 128.
hidden_dim_3d (int) – Hidden dim for 3d conv. Default: 512.
-
forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
forward_test(raw_feature, video_meta)[source]¶ Define the computation performed at every call when testing.
-
class
mmaction.models.localizers.BaseLocalizer(backbone, cls_head, train_cfg=None, test_cfg=None)[source]¶ Base class for localizers.
All localizers should subclass it. All subclass should overwrite: Methods:
forward_train, supporting to forward when training. Methods:forward_test, supporting to forward when testing.-
extract_feat(imgs)[source]¶ Extract features through a backbone.
- Parameters
imgs (torch.Tensor) – The input images.
- Returns
The extracted features.
- Return type
torch.tensor
-
train_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN.
- Parameters
data_batch (dict) – The output of dataloader.
optimizer (
torch.optim.Optimizer| dict) – The optimizer of runner is passed totrain_step(). This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss,log_vars, num_samples.lossis a tensor for back propagation, which can be a weighted sum of multiple losses.log_varscontains all the variables to be sent to the logger.num_samplesindicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs.
- It should contain at least 3 keys:
- Return type
dict
-
val_step(data_batch, optimizer, **kwargs)[source]¶ The iteration step during validation.
This method shares the same signature as
train_step(), but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook.
-
-
class
mmaction.models.localizers.PEM(pem_feat_dim, pem_hidden_dim, pem_u_ratio_m, pem_u_ratio_l, pem_high_temporal_iou_threshold, pem_low_temporal_iou_threshold, soft_nms_alpha, soft_nms_low_threshold, soft_nms_high_threshold, post_process_top_k, feature_extraction_interval=16, fc1_ratio=0.1, fc2_ratio=0.1, output_dim=1)[source]¶ Proposals Evaluation Model for Boundary Sensetive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
pem_feat_dim (int) – Feature dimension.
pem_hidden_dim (int) – Hidden layer dimension.
pem_u_ratio_m (float) – Ratio for medium score proprosals to balance data.
pem_u_ratio_l (float) – Ratio for low score proprosals to balance data.
pem_high_temporal_iou_threshold (float) – High IoU threshold.
pem_low_temporal_iou_threshold (float) – Low IoU threshold.
soft_nms_alpha (float) – Soft NMS alpha.
soft_nms_low_threshold (float) – Soft NMS low threshold.
soft_nms_high_threshold (float) – Soft NMS high threshold.
post_process_top_k (int) – Top k proposals in post process.
feature_extraction_interval (int) – Interval used in feature extraction. Default: 16.
fc1_ratio (float) – Ratio for fc1 layer output. Default: 0.1.
fc2_ratio (float) – Ratio for fc2 layer output. Default: 0.1.
output_dim (int) – Output dimension. Default: 1.
-
forward(bsp_feature, reference_temporal_iou=None, tmin=None, tmax=None, tmin_score=None, tmax_score=None, video_meta=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
class
mmaction.models.localizers.SSN(backbone, cls_head, in_channels=3, spatial_type='avg', dropout_ratio=0.5, loss_cls={'type': 'SSNLoss'}, train_cfg=None, test_cfg=None)[source]¶ Temporal Action Detection with Structured Segment Networks.
- Parameters
backbone (dict) – Config for building backbone.
cls_head (dict) – Config for building classification head.
in_channels (int) – Number of channels for input data. Default: 3.
spatial_type (str) – Type of spatial pooling. Default: ‘avg’.
dropout_ratio (float) – Ratio of dropout. Default: 0.5.
loss_cls (dict) – Config for building loss. Default:
dict(type='SSNLoss').train_cfg (dict | None) – Config for training. Default: None.
test_cfg (dict | None) – Config for testing. Default: None.
-
class
mmaction.models.localizers.TEM(temporal_dim, boundary_ratio, tem_feat_dim, tem_hidden_dim, tem_match_threshold, loss_cls={'type': 'BinaryLogisticRegressionLoss'}, loss_weight=2, output_dim=3, conv1_ratio=1, conv2_ratio=1, conv3_ratio=0.01)[source]¶ Temporal Evaluation Model for Boundary Sensetive Network.
Please refer BSN: Boundary Sensitive Network for Temporal Action Proposal Generation.
Code reference https://github.com/wzmsltw/BSN-boundary-sensitive-network
- Parameters
tem_feat_dim (int) – Feature dimension.
tem_hidden_dim (int) – Hidden layer dimension.
tem_match_threshold (float) – Temporal evaluation match threshold.
loss_cls (dict) – Config for building loss. Default:
dict(type='BinaryLogisticRegressionLoss').loss_weight (float) – Weight term for action_loss. Default: 2.
output_dim (int) – Output dimension. Default: 3.
conv1_ratio (float) – Ratio of conv1 layer output. Default: 1.0.
conv2_ratio (float) – Ratio of conv2 layer output. Default: 1.0.
conv3_ratio (float) – Ratio of conv3 layer output. Default: 0.01.
-
forward(raw_feature, gt_bbox=None, video_meta=None, return_loss=True)[source]¶ Define the computation performed at every call.
-
forward_test(raw_feature, video_meta)[source]¶ Define the computation performed at every call when testing.
common¶
-
class
mmaction.models.common.Conv2plus1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, norm_cfg={'type': 'BN3d'})[source]¶ (2+1)d Conv module for R(2+1)d backbone.
https://arxiv.org/pdf/1711.11248.pdf.
- Parameters
in_channels (int) – Same as nn.Conv3d.
out_channels (int) – Same as nn.Conv3d.
kernel_size (int | tuple[int]) – Same as nn.Conv3d.
stride (int | tuple[int]) – Same as nn.Conv3d.
padding (int | tuple[int]) – Same as nn.Conv3d.
dilation (int | tuple[int]) – Same as nn.Conv3d.
groups (int) – Same as nn.Conv3d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
-
class
mmaction.models.common.ConvAudio(in_channels, out_channels, kernel_size, op='concat', stride=1, padding=0, dilation=1, groups=1, bias=False)[source]¶ Conv2d module for AudioResNet backbone.
- Parameters
in_channels (int) – Same as nn.Conv2d.
out_channels (int) – Same as nn.Conv2d.
kernel_size (int | tuple[int]) – Same as nn.Conv2d.
op (string) – Operation to merge the output of freq and time feature map. Choices are ‘sum’ and ‘concat’. Default: ‘concat’.
stride (int | tuple[int]) – Same as nn.Conv2d.
padding (int | tuple[int]) – Same as nn.Conv2d.
dilation (int | tuple[int]) – Same as nn.Conv2d.
groups (int) – Same as nn.Conv2d.
bias (bool | str) – If specified as auto, it will be decided by the norm_cfg. Bias will be set as True if norm_cfg is None, otherwise False.
-
class
mmaction.models.common.LFB(lfb_prefix_path, max_num_sampled_feat=5, window_size=60, lfb_channels=2048, dataset_modes=('train', 'val'), device='gpu', lmdb_map_size=4000000000.0, construct_lmdb=True)[source]¶ Long-Term Feature Bank (LFB).
LFB is proposed in Long-Term Feature Banks for Detailed Video Understanding
The ROI features of videos are stored in the feature bank. The feature bank was generated by inferring with a lfb infer config.
Formally, LFB is a Dict whose keys are video IDs and its values are also Dicts whose keys are timestamps in seconds. Example of LFB:
- Parameters
lfb_prefix_path (str) – The storage path of lfb.
max_num_sampled_feat (int) – The max number of sampled features. Default: 5.
window_size (int) – Window size of sampling long term feature. Default: 60.
lfb_channels (int) – Number of the channels of the features stored in LFB. Default: 2048.
dataset_modes (tuple[str] | str) – Load LFB of datasets with different modes, such as training, validation, testing datasets. If you don’t do cross validation during training, just load the training dataset i.e. setting dataset_modes = (‘train’). Default: (‘train’, ‘val’).
device (str) – Where to load lfb. Choices are ‘gpu’, ‘cpu’ and ‘lmdb’. A 1.65GB half-precision ava lfb (including training and validation) occupies about 2GB GPU memory. Default: ‘gpu’.
lmdb_map_size (int) – Map size of lmdb. Default: 4e9.
construct_lmdb (bool) – Whether to construct lmdb. If you have constructed lmdb of lfb, you can set to False to skip the construction. Default: True.
-
class
mmaction.models.common.TAM(in_channels, num_segments, alpha=2, adaptive_kernel_size=3, beta=4, conv1d_kernel_size=3, adaptive_convolution_stride=1, adaptive_convolution_padding=1, init_std=0.001)[source]¶ Temporal Adaptive Module(TAM) for TANet.
This module is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
- Parameters
in_channels (int) – Channel num of input features.
num_segments (int) – Number of frame segments.
alpha (int) –
`alpha`in the paper and is the ratio of the intermediate channel number to the initial channel number in the global branch. Default: 2.adaptive_kernel_size (int) –
`K`in the paper and is the size of the adaptive kernel size in the global branch. Default: 3.beta (int) –
`beta`in the paper and is set to control the model complexity in the local branch. Default: 4.conv1d_kernel_size (int) – Size of the convolution kernel of Conv1d in the local branch. Default: 3.
adaptive_convolution_stride (int) – The first dimension of strides in the adaptive convolution of
`Temporal Adaptive Aggregation`. Default: 1.adaptive_convolution_padding (int) – The first dimension of paddings in the adaptive convolution of
`Temporal Adaptive Aggregation`. Default: 1.init_std (float) – Std value for initiation of nn.Linear. Default: 0.001.
backbones¶
-
class
mmaction.models.backbones.C3D(pretrained=None, style='pytorch', conv_cfg=None, norm_cfg=None, act_cfg=None, dropout_ratio=0.5, init_std=0.005)[source]¶ C3D backbone.
- Parameters
pretrained (str | None) – Name of pretrained model.
style (str) –
pytorchorcaffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.conv_cfg (dict | None) – Config dict for convolution layer. If set to None, it uses
dict(type='Conv3d')to construct layers. Default: None.norm_cfg (dict | None) – Config for norm layers. required keys are
type, Default: None.act_cfg (dict | None) – Config dict for activation layer. If set to None, it uses
dict(type='ReLU')to construct layers. Default: None.dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation of fc layers. Default: 0.01.
-
class
mmaction.models.backbones.MobileNetV2(pretrained=None, widen_factor=1.0, out_indices=(7), frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU6'}, norm_eval=False, with_cp=False)[source]¶ MobileNetV2 backbone.
- Parameters
pretrained (str | None) – Name of pretrained model. Default: None.
widen_factor (float) – Width multiplier, multiply number of channels in each layer by this amount. Default: 1.0.
out_indices (None or Sequence[int]) – Output from which stages. Default: (7, ).
frozen_stages (int) – Stages to be frozen (all param fixed). Default: -1, which means not freezing any parameters.
conv_cfg (dict) – Config dict for convolution layer. Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer. Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
-
forward(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
make_layer(out_channels, num_blocks, stride, expand_ratio)[source]¶ Stack InvertedResidual blocks to build a layer for MobileNetV2.
- Parameters
out_channels (int) – out_channels of block.
num_blocks (int) – number of blocks.
stride (int) – stride of the first block. Default: 1
expand_ratio (int) – Expand the number of channels of the hidden layer in InvertedResidual by this ratio. Default: 6.
-
train(mode=True)[source]¶ Sets the module in training mode.
This has any effect only on certain modules. See documentations of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g.
Dropout,BatchNorm, etc.- Parameters
mode (bool) – whether to set training mode (
True) or evaluation mode (False). Default:True.- Returns
self
- Return type
Module
-
class
mmaction.models.backbones.MobileNetV2TSM(num_segments=8, is_shift=True, shift_div=8, **kwargs)[source]¶ MobileNetV2 backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
shift_div (int) – Number of div for shift. Default: 8.
**kwargs (keyword arguments, optional) – Arguments for MobilNetV2.
-
class
mmaction.models.backbones.ResNet(depth, pretrained=None, torchvision_pretrain=True, in_channels=3, num_stages=4, out_indices=(3), strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), style='pytorch', frozen_stages=- 1, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, partial_bn=False, with_cp=False)[source]¶ ResNet backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
dilations (Sequence[int]) – Dilation of each stage.
style (str) –
pytorchorcaffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch.frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
partial_bn (bool) – Whether to use partial bn. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
-
class
mmaction.models.backbones.ResNet2Plus1d(*args, **kwargs)[source]¶ ResNet (2+1)d backbone.
This model is proposed in A Closer Look at Spatiotemporal Convolutions for Action Recognition
-
class
mmaction.models.backbones.ResNet3d(depth, pretrained, pretrained2d=True, in_channels=3, num_stages=4, base_channels=64, out_indices=(3), spatial_strides=(1, 2, 2, 2), temporal_strides=(1, 1, 1, 1), dilations=(1, 1, 1, 1), conv1_kernel=(5, 7, 7), conv1_stride_t=2, pool1_stride_t=2, with_pool2=True, style='pytorch', frozen_stages=- 1, inflate=(1, 1, 1, 1), inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, non_local=(0, 0, 0, 0), non_local_cfg={}, zero_init_residual=True, **kwargs)[source]¶ ResNet 3d backbone.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
in_channels (int) – Channel num of input features. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
out_indices (Sequence[int]) – Indices of output feature. Default: (3, ).
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2).temporal_strides (Sequence[int]) – Temporal strides of residual blocks of each stage. Default:
(1, 1, 1, 1).dilations (Sequence[int]) – Dilation of each stage. Default:
(1, 1, 1, 1).conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default:
(5, 7, 7).conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 2.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 2.
with_pool2 (bool) – Whether to use pool2. Default: True.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters. Default: -1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (1, 1, 1, 1).
inflate_style (str) –
3x1x1or1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
typeDefault:dict(type='Conv3d').norm_cfg (dict) – Config for norm layers. required keys are
typeandrequires_grad. Default:dict(type='BN3d', requires_grad=True).act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True).norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict().zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
forward(x)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
-
static
make_res_layer(block, inplanes, planes, blocks, spatial_stride=1, temporal_stride=1, dilation=1, style='pytorch', inflate=1, inflate_style='3x1x1', non_local=0, non_local_cfg={}, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶ Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
spatial_stride (int | Sequence[int]) – Spatial strides in residual and conv layers. Default: 1.
temporal_stride (int | Sequence[int]) – Temporal strides in residual and conv layers. Default: 1.
dilation (int) – Spacing between kernel elements. Default: 1.
style (str) –
pytorchorcaffe. If set topytorch, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default:pytorch.inflate (int | Sequence[int]) – Determine whether to inflate for each block. Default: 1.
inflate_style (str) –
3x1x1or1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.non_local (int | Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: 0.
non_local_cfg (dict) – Config for non-local module. Default:
dict().conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
-
class
mmaction.models.backbones.ResNet3dCSN(depth, pretrained, temporal_strides=(1, 2, 2, 2), conv1_kernel=(3, 7, 7), conv1_stride_t=1, pool1_stride_t=1, norm_cfg={'eps': 0.001, 'requires_grad': True, 'type': 'BN3d'}, inflate_style='3x3x3', bottleneck_mode='ir', bn_frozen=False, **kwargs)[source]¶ ResNet backbone for CSN.
- Parameters
depth (int) – Depth of ResNetCSN, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
temporal_strides (tuple[int]) – Temporal strides of residual blocks of each stage. Default: (1, 2, 2, 2).
conv1_kernel (tuple[int]) – Kernel size of the first conv layer. Default: (3, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN3d’, requires_grad=True, eps=1e-3).
inflate_style (str) – 3x1x1 or 1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x3x3’.
bottleneck_mode (str) –
Determine which ways to factorize a 3D bottleneck block using channel-separated convolutional networks.
If set to ‘ip’, it will replace the 3x3x3 conv2 layer with a 1x1x1 traditional convolution and a 3x3x3 depthwise convolution, i.e., Interaction-preserved channel-separated bottleneck block. If set to ‘ir’, it will replace the 3x3x3 conv2 layer with a 3x3x3 depthwise convolution, which is derived from preserved bottleneck block by removing the extra 1x1x1 convolution, i.e., Interaction-reduced channel-separated bottleneck block.
Default: ‘ip’.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
class
mmaction.models.backbones.ResNet3dLayer(depth, pretrained, pretrained2d=True, stage=3, base_channels=64, spatial_stride=2, temporal_stride=1, dilation=1, style='pytorch', all_frozen=False, inflate=1, inflate_style='3x1x1', conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶ ResNet 3d Layer.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
pretrained2d (bool) – Whether to load pretrained 2D model. Default: True.
stage (int) – The index of Resnet stage. Default: 3.
base_channels (int) – Channel num of stem output features. Default: 64.
spatial_stride (int) – The 1st res block’s spatial stride. Default 2.
temporal_stride (int) – The 1st res block’s temporal stride. Default 1.
dilation (int) – The dilation. Default: 1.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two layer is the 3x3 conv layer, otherwise the stride-two layer is the first 1x1 conv layer. Default: ‘pytorch’.
all_frozen (bool) – Frozen all modules in the layer. Default: False.
inflate (int) – Inflate Dims of each block. Default: 1.
inflate_style (str) –
3x1x1or1x1x1. which determines the kernel sizes and padding strides for conv1 and conv2 in each block. Default: ‘3x1x1’.conv_cfg (dict) – Config for conv layers. required keys are
typeDefault:dict(type='Conv3d').norm_cfg (dict) – Config for norm layers. required keys are
typeandrequires_grad. Default:dict(type='BN3d', requires_grad=True).act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True).norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
class
mmaction.models.backbones.ResNet3dSlowFast(pretrained, resample_rate=8, speed_ratio=8, channel_ratio=8, slow_pathway={'conv1_kernel': (1, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'dilations': (1, 1, 1, 1), 'inflate': (0, 0, 1, 1), 'lateral': True, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'}, fast_pathway={'base_channels': 8, 'conv1_kernel': (5, 7, 7), 'conv1_stride_t': 1, 'depth': 50, 'lateral': False, 'pool1_stride_t': 1, 'pretrained': None, 'type': 'resnet3d'})[source]¶ Slowfast backbone.
This module is proposed in SlowFast Networks for Video Recognition
- Parameters
pretrained (str) – The file path to a pretrained model.
resample_rate (int) – A large temporal stride
resample_rateon input frames. The actual resample rate is calculated by multipling theintervalinSampleFramesin the pipeline withresample_rate, equivalent to the \(\tau\) in the paper, i.e. it processes only one out ofresample_rate * intervalframes. Default: 8.speed_ratio (int) – Speed ratio indicating the ratio between time dimension of the fast and slow pathway, corresponding to the \(\alpha\) in the paper. Default: 8.
channel_ratio (int) – Reduce the channel number of fast pathway by
channel_ratio, corresponding to \(\beta\) in the paper. Default: 8.slow_pathway (dict) –
Configuration of slow branch, should contain necessary arguments for building the specific type of pathway and: type (str): type of backbone the pathway bases on. lateral (bool): determine whether to build lateral connection for the pathway.Default:
dict(type='ResNetPathway', lateral=True, depth=50, pretrained=None, conv1_kernel=(1, 7, 7), dilations=(1, 1, 1, 1), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1))
fast_pathway (dict) –
Configuration of fast branch, similar to slow_pathway. Default:
dict(type='ResNetPathway', lateral=False, depth=50, pretrained=None, base_channels=8, conv1_kernel=(5, 7, 7), conv1_stride_t=1, pool1_stride_t=1)
-
class
mmaction.models.backbones.ResNet3dSlowOnly(*args, lateral=False, conv1_kernel=(1, 7, 7), conv1_stride_t=1, pool1_stride_t=1, inflate=(0, 0, 1, 1), with_pool2=False, **kwargs)[source]¶ SlowOnly backbone based on ResNet3dPathway.
- Parameters
*args (arguments) – Arguments same as
ResNet3dPathway.conv1_kernel (Sequence[int]) – Kernel size of the first conv layer. Default: (1, 7, 7).
conv1_stride_t (int) – Temporal stride of the first conv layer. Default: 1.
pool1_stride_t (int) – Temporal stride of the first pooling layer. Default: 1.
inflate (Sequence[int]) – Inflate Dims of each block. Default: (0, 0, 1, 1).
**kwargs (keyword arguments) – Keywords arguments for
ResNet3dPathway.
-
class
mmaction.models.backbones.ResNetAudio(depth, pretrained, in_channels=1, num_stages=4, base_channels=32, strides=(1, 2, 2, 2), dilations=(1, 1, 1, 1), conv1_kernel=9, conv1_stride=1, frozen_stages=- 1, factorize=(1, 1, 0, 0), norm_eval=False, with_cp=False, conv_cfg={'type': 'Conv'}, norm_cfg={'requires_grad': True, 'type': 'BN2d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, zero_init_residual=True)[source]¶ ResNet 2d audio backbone. Reference:
- Parameters
depth (int) – Depth of resnet, from {50, 101, 152}.
pretrained (str | None) – Name of pretrained model.
in_channels (int) – Channel num of input features. Default: 1.
base_channels (int) – Channel num of stem output features. Default: 32.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilations (Sequence[int]) – Dilation of each stage. Default: (1, 1, 1, 1).
conv1_kernel (int) – Kernel size of the first conv layer. Default: 9.
conv1_stride (int | tuple[int]) – Stride of the first conv layer. Default: 1.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means not freezing any parameters.
factorize (Sequence[int]) – factorize Dims of each block for audio. Default: (1, 1, 0, 0).
norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict) – Config for norm layers. Default: dict(type=’Conv’).
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: dict(type=’BN2d’, requires_grad=True).
act_cfg (dict) – Config for activate layers. Default: dict(type=’ReLU’, inplace=True).
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
-
forward(x)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
-
make_res_layer(block, inplanes, planes, blocks, stride=1, dilation=1, factorize=1, norm_cfg=None, with_cp=False)[source]¶ Build residual layer for ResNetAudio.
- Parameters
block (nn.Module) – Residual module to be built.
inplanes (int) – Number of channels for the input feature in each block.
planes (int) – Number of channels for the output feature in each block.
blocks (int) – Number of residual blocks.
strides (Sequence[int]) – Strides of residual blocks of each stage. Default: (1, 2, 2, 2).
dilation (int) – Spacing between kernel elements. Default: 1.
factorize (int | Sequence[int]) – Determine whether to factorize for each block. Default: 1.
norm_cfg (dict) – Config for norm layers. required keys are type and requires_grad. Default: None.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
-
class
mmaction.models.backbones.ResNetTIN(depth, num_segments=8, is_tin=True, shift_div=4, **kwargs)[source]¶ ResNet backbone for TIN.
- Parameters
depth (int) – Depth of ResNet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments. Default: 8.
is_tin (bool) – Whether to apply temporal interlace. Default: True.
shift_div (int) – Number of division parts for shift. Default: 4.
kwargs (dict, optional) – Arguments for ResNet.
-
class
mmaction.models.backbones.ResNetTSM(depth, num_segments=8, is_shift=True, non_local=(0, 0, 0, 0), non_local_cfg={}, shift_div=8, shift_place='blockres', temporal_pool=False, **kwargs)[source]¶ ResNet backbone for TSM.
- Parameters
num_segments (int) – Number of frame segments. Default: 8.
is_shift (bool) – Whether to make temporal shift in reset layers. Default: True.
non_local (Sequence[int]) – Determine whether to apply non-local module in the corresponding block of each stages. Default: (0, 0, 0, 0).
non_local_cfg (dict) – Config for non-local module. Default:
dict().shift_div (int) – Number of div for shift. Default: 8.
shift_place (str) – Places in resnet layers for shift, which is chosen from [‘block’, ‘blockres’]. If set to ‘block’, it will apply temporal shift to all child blocks in each resnet layer. If set to ‘blockres’, it will apply temporal shift to each conv1 layer of all child blocks in each resnet layer. Default: ‘blockres’.
temporal_pool (bool) – Whether to add temporal pooling. Default: False.
**kwargs (keyword arguments, optional) – Arguments for ResNet.
-
class
mmaction.models.backbones.TANet(depth, num_segments, tam_cfg={}, **kwargs)[source]¶ Temporal Adaptive Network (TANet) backbone.
This backbone is proposed in TAM: TEMPORAL ADAPTIVE MODULE FOR VIDEO RECOGNITION
Embedding the temporal adaptive module (TAM) into ResNet to instantiate TANet.
- Parameters
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
num_segments (int) – Number of frame segments.
tam_cfg (dict | None) – Config for temporal adaptive module (TAM). Default: dict().
**kwargs (keyword arguments, optional) – Arguments for ResNet except
`depth`.
-
class
mmaction.models.backbones.X3D(gamma_w=1.0, gamma_b=1.0, gamma_d=1.0, pretrained=None, in_channels=3, num_stages=4, spatial_strides=(2, 2, 2, 2), frozen_stages=- 1, se_style='half', se_ratio=0.0625, use_swish=True, conv_cfg={'type': 'Conv3d'}, norm_cfg={'requires_grad': True, 'type': 'BN3d'}, act_cfg={'inplace': True, 'type': 'ReLU'}, norm_eval=False, with_cp=False, zero_init_residual=True, **kwargs)[source]¶ X3D backbone. https://arxiv.org/pdf/2004.04730.pdf.
- Parameters
gamma_w (float) – Global channel width expansion factor. Default: 1.
gamma_b (float) – Bottleneck channel width expansion factor. Default: 1.
gamma_d (float) – Network depth expansion factor. Default: 1.
pretrained (str | None) – Name of pretrained model. Default: None.
in_channels (int) – Channel num of input features. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
spatial_strides (Sequence[int]) – Spatial strides of residual blocks of each stage. Default:
(1, 2, 2, 2).frozen_stages (int) – Stages to be frozen (all param fixed). If set to -1, it means not freezing any parameters. Default: -1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: 1 / 16.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict) – Config for conv layers. required keys are
typeDefault:dict(type='Conv3d').norm_cfg (dict) – Config for norm layers. required keys are
typeandrequires_grad. Default:dict(type='BN3d', requires_grad=True).act_cfg (dict) – Config dict for activation layer. Default:
dict(type='ReLU', inplace=True).norm_eval (bool) – Whether to set BN layers to eval mode, namely, freeze running stats (mean and var). Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero initialization for residual block, Default: True.
kwargs (dict, optional) – Key arguments for “make_res_layer”.
-
forward(x)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
- Returns
The feature of the input samples extracted by the backbone.
- Return type
torch.Tensor
-
make_res_layer(block, layer_inplanes, inplanes, planes, blocks, spatial_stride=1, se_style='half', se_ratio=None, use_swish=True, norm_cfg=None, act_cfg=None, conv_cfg=None, with_cp=False, **kwargs)[source]¶ Build residual layer for ResNet3D.
- Parameters
block (nn.Module) – Residual module to be built.
layer_inplanes (int) – Number of channels for the input feature of the res layer.
inplanes (int) – Number of channels for the input feature in each block, which equals to base_channels * gamma_w.
planes (int) – Number of channels for the output feature in each block, which equals to base_channel * gamma_w * gamma_b.
blocks (int) – Number of residual blocks.
spatial_stride (int) – Spatial strides in residual and conv layers. Default: 1.
se_style (str) – The style of inserting SE modules into BlockX3D, ‘half’ denotes insert into half of the blocks, while ‘all’ denotes insert into all blocks. Default: ‘half’.
se_ratio (float | None) – The reduction ratio of squeeze and excitation unit. If set as None, it means not using SE unit. Default: None.
use_swish (bool) – Whether to use swish as the activation function before and after the 3x3x3 conv. Default: True.
conv_cfg (dict | None) – Config for norm layers. Default: None.
norm_cfg (dict | None) – Config for norm layers. Default: None.
act_cfg (dict | None) – Config for activate layers. Default: None.
with_cp (bool | None) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
- Returns
A residual layer for the given config.
- Return type
nn.Module
heads¶
-
class
mmaction.models.heads.AudioTSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶ Classification head for TSN on audio.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.heads.BBoxHeadAVA(temporal_pool_type='avg', spatial_pool_type='max', in_channels=2048, num_classes=81, dropout_ratio=0, dropout_before_pool=True, topk=(3, 5), multilabel=True)[source]¶ Simplest RoI head, with only two fc layers for classification and regression respectively.
- Parameters
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
in_channels (int) – The number of input channels. Default: 2048.
num_classes (int) – The number of classes. Default: 81.
dropout_ratio (float) – A float in [0, 1], indicates the dropout_ratio. Default: 0.
dropout_before_pool (bool) – Dropout Feature before spatial temporal pooling. Default: True.
topk (int or tuple[int]) – Parameter for evaluating multilabel accuracy. Default: (3, 5)
multilabel (bool) – Whether used for a multilabel task. Default: True. (Only support multilabel == True now).
-
forward(x)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.heads.BaseHead(num_classes, in_channels, loss_cls={'loss_weight': 1.0, 'type': 'CrossEntropyLoss'}, multi_class=False, label_smooth_eps=0.0)[source]¶ Base class for head.
All Head should subclass it. All subclass should overwrite: - Methods:
init_weights, initializing weights in some modules. - Methods:forward, supporting to forward both for training and testing.- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’, loss_weight=1.0).
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: arxiv.org/abs/1906.02629. Default: 0.
-
abstract
init_weights()[source]¶ Initiate the parameters either from existing checkpoint or from scratch.
-
loss(cls_score, labels, **kwargs)[source]¶ Calculate the loss given output
cls_score, targetlabels.- Parameters
cls_score (torch.Tensor) – The output of the model.
labels (torch.Tensor) – The target output of the model.
- Returns
A dict containing field ‘loss_cls’(mandatory) and ‘top1_acc’, ‘top5_acc’(optional).
- Return type
dict
-
class
mmaction.models.heads.FBOHead(lfb_cfg, fbo_cfg, temporal_pool_type='avg', spatial_pool_type='max')[source]¶ Feature Bank Operator Head.
Add feature bank operator for the spatiotemporal detection model to fuse short-term features and long-term features.
- Parameters
lfb_cfg (Dict) – The config dict for LFB which is used to sample long-term features.
fbo_cfg (Dict) – The config dict for feature bank operator (FBO). The type of fbo is also in the config dict and supported fbo type is fbo_dict.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
-
forward(x, rois, img_metas)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.heads.I3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, **kwargs)[source]¶ Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.heads.LFBInferHead(lfb_prefix_path, dataset_mode='train', use_half_precision=True, temporal_pool_type='avg', spatial_pool_type='max')[source]¶ Long-Term Feature Bank Infer Head.
This head is used to derive and save the LFB without affecting the input.
- Parameters
lfb_prefix_path (str) – The prefix path to store the lfb.
dataset_mode (str, optional) – Which dataset to be inferred. Choices are ‘train’, ‘val’ or ‘test’. Default: ‘train’.
use_half_precision (bool, optional) – Whether to store the half-precision roi features. Default: True.
temporal_pool_type (str) – The temporal pool type. Choices are ‘avg’ or ‘max’. Default: ‘avg’.
spatial_pool_type (str) – The spatial pool type. Choices are ‘avg’ or ‘max’. Default: ‘max’.
-
forward(x, rois, img_metas)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
mmaction.models.heads.SSNHead(dropout_ratio=0.8, in_channels=1024, num_classes=20, consensus={'num_seg': (2, 5, 2), 'standalong_classifier': True, 'stpp_cfg': (1, 1, 1), 'type': 'STPPTrain'}, use_regression=True, init_std=0.001)[source]¶ The classification head for SSN.
- Parameters
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
in_channels (int) – Number of channels for input data. Default: 1024.
num_classes (int) – Number of classes to be classified. Default: 20.
consensus (dict) – Config of segmental consensus.
use_regression (bool) – Whether to perform regression or not. Default: True.
init_std (float) – Std value for Initiation. Default: 0.001.
-
class
mmaction.models.heads.SlowFastHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.8, init_std=0.01, **kwargs)[source]¶ The classification head for SlowFast.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.heads.TPNHead(*args, **kwargs)[source]¶ Class head for TPN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
multi_class (bool) – Determines whether it is a multi-class recognition task. Default: False.
label_smooth_eps (float) – Epsilon used in label smooth. Reference: https://arxiv.org/abs/1906.02629. Default: 0.
-
forward(x, num_segs=None, fcn_test=False)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int | None) – Number of segments into which a video is divided. Default: None.
fcn_test (bool) – Whether to apply full convolution (fcn) testing. Default: False.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
-
class
mmaction.models.heads.TRNHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', relation_type='TRNMultiScale', hidden_dim=256, dropout_ratio=0.8, init_std=0.001, **kwargs)[source]¶ Class head for TRN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
relation_type (str) – The relation module type. Choices are ‘TRN’ or ‘TRNMultiScale’. Default: ‘TRNMultiScale’.
hidden_dim (int) – The dimension of hidden layer of MLP in relation module. Default: 256.
dropout_ratio (float) – Probability of dropout layer. Default: 0.8.
init_std (float) – Std value for Initiation. Default: 0.001.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
forward(x, num_segs)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TRNHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TRN models. The self.num_segments we need is a hyper parameter to build TRN models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
-
class
mmaction.models.heads.TSMHead(num_classes, in_channels, num_segments=8, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.8, init_std=0.001, is_shift=True, temporal_pool=False, **kwargs)[source]¶ Class head for TSM.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
num_segments (int) – Number of frame segments. Default: 8.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
is_shift (bool) – Indicating whether the feature is shifted. Default: True.
temporal_pool (bool) – Indicating whether feature is temporal pooled. Default: False.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
forward(x, num_segs)[source]¶ Defines the computation performed at every call.
- Parameters
x (torch.Tensor) – The input data.
num_segs (int) – Useless in TSMHead. By default, num_segs is equal to clip_len * num_clips * num_crops, which is automatically generated in Recognizer forward phase and useless in TSM models. The self.num_segments we need is a hyper parameter to build TSM models.
- Returns
The classification scores for input samples.
- Return type
torch.Tensor
-
class
mmaction.models.heads.TSNHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', consensus={'dim': 1, 'type': 'AvgConsensus'}, dropout_ratio=0.4, init_std=0.01, **kwargs)[source]¶ Class head for TSN.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’).
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
consensus (dict) – Consensus config dict.
dropout_ratio (float) – Probability of dropout layer. Default: 0.4.
init_std (float) – Std value for Initiation. Default: 0.01.
kwargs (dict, optional) – Any keyword argument to be used to initialize the head.
-
class
mmaction.models.heads.X3DHead(num_classes, in_channels, loss_cls={'type': 'CrossEntropyLoss'}, spatial_type='avg', dropout_ratio=0.5, init_std=0.01, fc1_bias=False)[source]¶ Classification head for I3D.
- Parameters
num_classes (int) – Number of classes to be classified.
in_channels (int) – Number of channels in input feature.
loss_cls (dict) – Config for building loss. Default: dict(type=’CrossEntropyLoss’)
spatial_type (str) – Pooling type in spatial dimension. Default: ‘avg’.
dropout_ratio (float) – Probability of dropout layer. Default: 0.5.
init_std (float) – Std value for Initiation. Default: 0.01.
fc1_bias (bool) – If the first fc layer has bias. Default: False.
necks¶
-
class
mmaction.models.necks.TPN(in_channels, out_channels, spatial_modulation_cfg=None, temporal_modulation_cfg=None, upsample_cfg=None, downsample_cfg=None, level_fusion_cfg=None, aux_head_cfg=None, flow_type='cascade')[source]¶ TPN neck.
This module is proposed in Temporal Pyramid Network for Action Recognition
- Parameters
in_channels (tuple[int]) – Channel numbers of input features tuple.
out_channels (int) – Channel number of output feature.
spatial_modulation_cfg (dict | None) – Config for spatial modulation layers. Required keys are in_channels and out_channels. Default: None.
temporal_modulation_cfg (dict | None) – Config for temporal modulation layers. Default: None.
upsample_cfg (dict | None) – Config for upsample layers. The keys are same as that in :class:
nn.Upsample. Default: None.downsample_cfg (dict | None) – Config for downsample layers. Default: None.
level_fusion_cfg (dict | None) – Config for level fusion layers. Required keys are ‘in_channels’, ‘mid_channels’, ‘out_channels’. Default: None.
aux_head_cfg (dict | None) – Config for aux head layers. Required keys are ‘out_channels’. Default: None.
flow_type (str) – Flow type to combine the features. Options are ‘cascade’ and ‘parallel’. Default: ‘cascade’.
-
forward(x, target=None)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
losses¶
-
class
mmaction.models.losses.BCELossWithLogits(loss_weight=1.0, class_weight=None)[source]¶ Binary Cross Entropy Loss with logits.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
-
class
mmaction.models.losses.BMNLoss[source]¶ BMN Loss.
From paper https://arxiv.org/abs/1907.09702, code https://github.com/JJBOY/BMN-Boundary-Matching-Network. It will calculate loss for BMN Model. This loss is a weighted sum of
1) temporal evaluation loss based on confidence score of start and end positions. 2) proposal evaluation regression loss based on confidence scores of candidate proposals. 3) proposal evaluation classification loss based on classification results of candidate proposals.
-
forward(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, bm_mask, weight_tem=1.0, weight_pem_reg=10.0, weight_pem_cls=1.0)[source]¶ Calculate Boundary Matching Network Loss.
- Parameters
pred_bm (torch.Tensor) – Predicted confidence score for boundary matching map.
pred_start (torch.Tensor) – Predicted confidence score for start.
pred_end (torch.Tensor) – Predicted confidence score for end.
gt_iou_map (torch.Tensor) – Groundtruth score for boundary matching map.
gt_start (torch.Tensor) – Groundtruth temporal_iou score for start.
gt_end (torch.Tensor) – Groundtruth temporal_iou score for end.
bm_mask (torch.Tensor) – Boundary-Matching mask.
weight_tem (float) – Weight for tem loss. Default: 1.0.
weight_pem_reg (float) – Weight for pem regression loss. Default: 10.0.
weight_pem_cls (float) – Weight for pem classification loss. Default: 1.0.
- Returns
(loss, tem_loss, pem_reg_loss, pem_cls_loss). Loss is the bmn loss, tem_loss is the temporal evaluation loss, pem_reg_loss is the proposal evaluation regression loss, pem_cls_loss is the proposal evaluation classification loss.
- Return type
tuple([torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor])
-
static
pem_cls_loss(pred_score, gt_iou_map, mask, threshold=0.9, ratio_range=(1.05, 21), eps=1e-05)[source]¶ Calculate Proposal Evaluation Module Classification Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
threshold (float) – Threshold of temporal_iou for positive instances. Default: 0.9.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5
- Returns
Proposal evalutaion classification loss.
- Return type
torch.Tensor
-
static
pem_reg_loss(pred_score, gt_iou_map, mask, high_temporal_iou_threshold=0.7, low_temporal_iou_threshold=0.3)[source]¶ Calculate Proposal Evaluation Module Regression Loss.
- Parameters
pred_score (torch.Tensor) – Predicted temporal_iou score by BMN.
gt_iou_map (torch.Tensor) – Groundtruth temporal_iou score.
mask (torch.Tensor) – Boundary-Matching mask.
high_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.7.
low_temporal_iou_threshold (float) – Higher threshold of temporal_iou. Default: 0.3.
- Returns
Proposal evalutaion regression loss.
- Return type
torch.Tensor
-
static
tem_loss(pred_start, pred_end, gt_start, gt_end)[source]¶ Calculate Temporal Evaluation Module Loss.
This function calculate the binary_logistic_regression_loss for start and end respectively and returns the sum of their losses.
- Parameters
pred_start (torch.Tensor) – Predicted start score by BMN model.
pred_end (torch.Tensor) – Predicted end score by BMN model.
gt_start (torch.Tensor) – Groundtruth confidence score for start.
gt_end (torch.Tensor) – Groundtruth confidence score for end.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
-
-
class
mmaction.models.losses.BaseWeightedLoss(loss_weight=1.0)[source]¶ Base class for loss.
All subclass should overwrite the
_forward()method which returns the normal loss without loss weights.- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
-
class
mmaction.models.losses.BinaryLogisticRegressionLoss[source]¶ Binary Logistic Regression Loss.
It will calculate binary logistic regression loss given reg_score and label.
-
forward(reg_score, label, threshold=0.5, ratio_range=(1.05, 21), eps=1e-05)[source]¶ Calculate Binary Logistic Regression Loss.
- Parameters
reg_score (torch.Tensor) – Predicted score by model.
label (torch.Tensor) – Groundtruth labels.
threshold (float) – Threshold for positive instances. Default: 0.5.
ratio_range (tuple) – Lower bound and upper bound for ratio. Default: (1.05, 21)
eps (float) – Epsilon for small value. Default: 1e-5.
- Returns
Returned binary logistic loss.
- Return type
torch.Tensor
-
-
class
mmaction.models.losses.CrossEntropyLoss(loss_weight=1.0, class_weight=None)[source]¶ Cross Entropy Loss.
Support two kinds of labels and their corresponding loss type. It’s worth mentioning that loss type will be detected by the shape of
cls_scoreandlabel. 1) Hard label: This label is an integer array and all of the elements arein the range [0, num_classes - 1]. This label’s shape should be
cls_score’s shape with the num_classes dimension removed.- Soft label(probablity distribution over classes): This label is a
probability distribution and all of the elements are in the range [0, 1]. This label’s shape must be the same as
cls_score. For now, only 2-dim soft label is supported.
- Parameters
loss_weight (float) – Factor scalar multiplied on the loss. Default: 1.0.
class_weight (list[float] | None) – Loss weight for each class. If set as None, use the same weight 1 for all classes. Only applies to CrossEntropyLoss and BCELossWithLogits (should not be set when using other losses). Default: None.
-
class
mmaction.models.losses.HVULoss(categories=('action', 'attribute', 'concept', 'event', 'object', 'scene'), category_nums=(739, 117, 291, 69, 1678, 248), category_loss_weights=(1, 1, 1, 1, 1, 1), loss_type='all', with_mask=False, reduction='mean', loss_weight=1.0)[source]¶ Calculate the BCELoss for HVU.
- Parameters
categories (tuple[str]) – Names of tag categories, tags are organized in this order. Default: [‘action’, ‘attribute’, ‘concept’, ‘event’, ‘object’, ‘scene’].
category_nums (tuple[int]) – Number of tags for each category. Default: (739, 117, 291, 69, 1678, 248).
category_loss_weights (tuple[float]) – Loss weights of categories, it applies only if loss_type == ‘individual’. The loss weights will be normalized so that the sum equals to 1, so that you can give any positive number as loss weight. Default: (1, 1, 1, 1, 1, 1).
loss_type (str) – The loss type we calculate, we can either calculate the BCELoss for all tags, or calculate the BCELoss for tags in each category. Choices are ‘individual’ or ‘all’. Default: ‘all’.
with_mask (bool) – Since some tag categories are missing for some video clips. If with_mask == True, we will not calculate loss for these missing categories. Otherwise, these missing categories are treated as negative samples.
reduction (str) – Reduction way. Choices are ‘mean’ or ‘sum’. Default: ‘mean’.
loss_weight (float) – The loss weight. Default: 1.0.
-
class
mmaction.models.losses.NLLLoss(loss_weight=1.0)[source]¶ NLL Loss.
It will calculate NLL loss given cls_score and label.
-
class
mmaction.models.losses.OHEMHingeLoss[source]¶ This class is the core implementation for the completeness loss in paper.
It compute class-wise hinge loss and performs online hard example mining (OHEM).
-
static
backward(ctx, grad_output)[source]¶ Defines a formula for differentiating the operation.
This function is to be overridden by all subclasses.
It must accept a context
ctxas the first argument, followed by as many outputs didforward()return, and it should return as many tensors, as there were inputs toforward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_gradas a tuple of booleans representing whether each input needs gradient. E.g.,backward()will havectx.needs_input_grad[0] = Trueif the first input toforward()needs gradient computated w.r.t. the output.
-
static
forward(ctx, pred, labels, is_positive, ohem_ratio, group_size)[source]¶ Calculate OHEM hinge loss.
- Parameters
pred (torch.Tensor) – Predicted completeness score.
labels (torch.Tensor) – Groundtruth class label.
is_positive (int) – Set to 1 when proposals are positive and set to -1 when proposals are incomplete.
ohem_ratio (float) – Ratio of hard examples.
group_size (int) – Number of proposals sampled per video.
- Returns
Returned class-wise hinge loss.
- Return type
torch.Tensor
-
static
-
class
mmaction.models.losses.SSNLoss[source]¶ -
static
activity_loss(activity_score, labels, activity_indexer)[source]¶ Activity Loss.
It will calculate activity loss given activity_score and label.
- Args:
activity_score (torch.Tensor): Predicted activity score. labels (torch.Tensor): Groundtruth class label. activity_indexer (torch.Tensor): Index slices of proposals.
- Returns
Returned cross entropy loss.
- Return type
torch.Tensor
-
static
classwise_regression_loss(bbox_pred, labels, bbox_targets, regression_indexer)[source]¶ Classwise Regression Loss.
It will calculate classwise_regression loss given class_reg_pred and targets.
- Args:
- bbox_pred (torch.Tensor): Predicted interval center and span
of positive proposals.
labels (torch.Tensor): Groundtruth class label. bbox_targets (torch.Tensor): Groundtruth center and span
of positive proposals.
- regression_indexer (torch.Tensor): Index slices of
positive proposals.
- Returns
Returned class-wise regression loss.
- Return type
torch.Tensor
-
static
completeness_loss(completeness_score, labels, completeness_indexer, positive_per_video, incomplete_per_video, ohem_ratio=0.17)[source]¶ Completeness Loss.
It will calculate completeness loss given completeness_score and label.
- Args:
completeness_score (torch.Tensor): Predicted completeness score. labels (torch.Tensor): Groundtruth class label. completeness_indexer (torch.Tensor): Index slices of positive and
incomplete proposals.
- positive_per_video (int): Number of positive proposals sampled
per video.
- incomplete_per_video (int): Number of incomplete proposals sampled
pre video.
- ohem_ratio (float): Ratio of online hard example mining.
Default: 0.17.
- Returns
Returned class-wise completeness loss.
- Return type
torch.Tensor
-
forward(activity_score, completeness_score, bbox_pred, proposal_type, labels, bbox_targets, train_cfg)[source]¶ Calculate Boundary Matching Network Loss.
- Parameters
activity_score (torch.Tensor) – Predicted activity score.
completeness_score (torch.Tensor) – Predicted completeness score.
bbox_pred (torch.Tensor) – Predicted interval center and span of positive proposals.
proposal_type (torch.Tensor) – Type index slices of proposals.
labels (torch.Tensor) – Groundtruth class label.
bbox_targets (torch.Tensor) – Groundtruth center and span of positive proposals.
train_cfg (dict) – Config for training.
- Returns
(loss_activity, loss_completeness, loss_reg). Loss_activity is the activity loss, loss_completeness is the class-wise completeness loss, loss_reg is the class-wise regression loss.
- Return type
dict([torch.Tensor, torch.Tensor, torch.Tensor])
-
static
mmaction.datasets¶
datasets¶
-
class
mmaction.datasets.AVADataset(ann_file, exclude_file, pipeline, label_file=None, filename_tmpl='img_{:05}.jpg', proposal_file=None, person_det_score_thr=0.9, num_classes=81, custom_classes=None, data_prefix=None, test_mode=False, modality='RGB', num_max_proposals=1000, timestamp_start=900, timestamp_end=1800)[source]¶ AVA dataset for spatial temporal detection.
Based on official AVA annotation files, the dataset loads raw frames, bounding boxes, proposals and applies specified transformations to return a dict containing the frame tensors and other information.
This datasets can load information from the following files:
ann_file -> ava_{train, val}_{v2.1, v2.2}.csv exclude_file -> ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv label_file -> ava_action_list_{v2.1, v2.2}.pbtxt / ava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt proposal_file -> ava_dense_proposals_{train, val}.FAIR.recall_93.9.pklParticularly, the proposal_file is a pickle file which contains
img_key(in format of{video_id},{timestamp}). Example of a pickle file:{ ... '0f39OWEqJ24,0902': array([[0.011 , 0.157 , 0.655 , 0.983 , 0.998163]]), '0f39OWEqJ24,0912': array([[0.054 , 0.088 , 0.91 , 0.998 , 0.068273], [0.016 , 0.161 , 0.519 , 0.974 , 0.984025], [0.493 , 0.283 , 0.981 , 0.984 , 0.983621]]), ... }- Parameters
ann_file (str) – Path to the annotation file like
ava_{train, val}_{v2.1, v2.2}.csv.exclude_file (str) – Path to the excluded timestamp file like
ava_{train, val}_excluded_timestamps_{v2.1, v2.2}.csv.pipeline (list[dict | callable]) – A sequence of data transforms.
label_file (str) – Path to the label file like
ava_action_list_{v2.1, v2.2}.pbtxtorava_action_list_{v2.1, v2.2}_for_activitynet_2019.pbtxt. Default: None.filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
proposal_file (str) – Path to the proposal file like
ava_dense_proposals_{train, val}.FAIR.recall_93.9.pkl. Default: None.person_det_score_thr (float) – The threshold of person detection scores, bboxes with scores above the threshold will be used. Default: 0.9. Note that 0 <= person_det_score_thr <= 1. If no proposal has detection score larger than the threshold, the one with the largest detection score will be used.
num_classes (int) – The number of classes of the dataset. Default: 81. (AVA has 80 action classes, another 1-dim is added for potential usage)
custom_classes (list[int]) – A subset of class ids from origin dataset. Please note that 0 should NOT be selected, and
num_classesshould be equal tolen(custom_classes) + 1data_prefix (str) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
num_max_proposals (int) – Max proposals number to store. Default: 1000.
timestamp_start (int) – The start point of included timestamps. The default value is referred from the official website. Default: 902.
timestamp_end (int) – The end point of included timestamps. The default value is referred from the official website. Default: 1798.
-
evaluate(results, metrics=('mAP'), metric_options=None, logger=None)[source]¶ Perform evaluation for common datasets.
- Parameters
results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘top_k_accuracy’.
metric_options (dict) – Dict for metric options. Options are
topkfortop_k_accuracy. Default:dict(top_k_accuracy=dict(topk=(1, 5))).logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results dict.
- Return type
dict
-
class
mmaction.datasets.ActivityNetDataset(ann_file, pipeline, data_prefix=None, test_mode=False)[source]¶ ActivityNet dataset for temporal action localization.
The dataset loads raw features and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a json file with multiple objects, and each object has a key of the name of a video, and value of total frames of the video, total seconds of the video, annotations of a video, feature frames (frames covered by features) of the video, fps and rfps. Example of a annotation file:
{ "v_--1DO2V4K74": { "duration_second": 211.53, "duration_frame": 6337, "annotations": [ { "segment": [ 30.025882995319815, 205.2318595943838 ], "label": "Rock climbing" } ], "feature_frame": 6336, "fps": 30.0, "rfps": 29.9579255898 }, "v_--6bJUbfpnQ": { "duration_second": 26.75, "duration_frame": 647, "annotations": [ { "segment": [ 2.578755070202808, 24.914101404056165 ], "label": "Drinking beer" } ], "feature_frame": 624, "fps": 24.0, "rfps": 24.1869158879 }, ... }- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
-
dump_results(results, out, output_format, version='VERSION 1.3')[source]¶ Dump data to json/csv files.
-
evaluate(results, metrics='AR@AN', metric_options={'AR@AN': {'max_avg_proposals': 100, 'temporal_iou_thresholds': array([0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95])}}, logger=None, **deprecated_kwargs)[source]¶ Evaluation in feature dataset.
- Parameters
results (list[dict]) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘AR@AN’.
metric_options (dict) – Dict for metric options. Options are
max_avg_proposals,temporal_iou_thresholdsforAR@AN. default:{'AR@AN': dict(max_avg_proposals=100, temporal_iou_thresholds=np.linspace(0.5, 0.95, 10))}.logger (logging.Logger | None) – Training logger. Defaults: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results for evaluation metrics.
- Return type
dict
-
static
proposals2json(results, show_progress=False)[source]¶ Convert all proposals to a final dict(json) format.
- Parameters
results (list[dict]) – All proposals.
show_progress (bool) – Whether to show the progress bar. Defaults: False.
- Returns
The final result dict. E.g.
dict(video-1=[dict(segment=[1.1,2.0]. score=0.9), dict(segment=[50.1, 129.3], score=0.6)])
- Return type
dict
-
class
mmaction.datasets.AudioDataset(ann_file, pipeline, suffix='.wav', **kwargs)[source]¶ Audio dataset for video recognition. Extracts the audio feature on-the- fly. Annotation file can be that of the rawframe dataset, or:
some/directory-1.wav 163 1 some/directory-2.wav 122 1 some/directory-3.wav 258 2 some/directory-4.wav 234 2 some/directory-5.wav 295 3 some/directory-6.wav 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
suffix (str) – The suffix of the audio file. Default: ‘.wav’.
kwargs (dict) – Other keyword args for BaseDataset.
-
class
mmaction.datasets.AudioFeatureDataset(ann_file, pipeline, suffix='.npy', **kwargs)[source]¶ Audio feature dataset for video recognition. Reads the features extracted off-line. Annotation file can be that of the rawframe dataset, or:
some/directory-1.npy 163 1 some/directory-2.npy 122 1 some/directory-3.npy 258 2 some/directory-4.npy 234 2 some/directory-5.npy 295 3 some/directory-6.npy 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
suffix (str) – The suffix of the audio feature file. Default: ‘.npy’.
kwargs (dict) – Other keyword args for BaseDataset.
-
class
mmaction.datasets.AudioVisualDataset(ann_file, pipeline, audio_prefix, **kwargs)[source]¶ Dataset that reads both audio and visual data, supporting both rawframes and videos. The annotation file is same as that of the rawframe dataset, such as:
some/directory-1 163 1 some/directory-2 122 1 some/directory-3 258 2 some/directory-4 234 2 some/directory-5 295 3 some/directory-6 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
audio_prefix (str) – Directory of the audio files.
kwargs (dict) – Other keyword args for RawframeDataset. video_prefix is also allowed if pipeline is designed for videos.
-
class
mmaction.datasets.BaseDataset(ann_file, pipeline, data_prefix=None, test_mode=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=None)[source]¶ Base class for datasets.
All datasets to process video should subclass it. All subclasses should overwrite:
Methods:load_annotations, supporting to load information from an
annotation file. - Methods:prepare_train_frames, providing train data. - Methods:prepare_test_frames, providing test data.
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
multi_class (bool) – Determines whether the dataset is a multi-class dataset. Default: False.
num_classes (int | None) – Number of classes of the dataset, used in multi-class datasets. Default: None.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 1.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’, ‘Audio’. Default: ‘RGB’.
sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.
power (float | None) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: None.
-
evaluate(results, metrics='top_k_accuracy', metric_options={'top_k_accuracy': {'topk': (1, 5)}}, logger=None, **deprecated_kwargs)[source]¶ Perform evaluation for common datasets.
- Parameters
results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘top_k_accuracy’.
metric_options (dict) – Dict for metric options. Options are
topkfortop_k_accuracy. Default:dict(top_k_accuracy=dict(topk=(1, 5))).logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results dict.
- Return type
dict
-
class
mmaction.datasets.CutmixBlending(num_classes, alpha=0.2)[source]¶ Implementing Cutmix in a mini-batch.
This module is proposed in CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. Code Reference https://github.com/clovaai/CutMix-PyTorch
- Parameters
num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.
-
class
mmaction.datasets.HVUDataset(ann_file, pipeline, tag_categories, tag_category_nums, filename_tmpl=None, **kwargs)[source]¶ HVU dataset, which supports the recognition tags of multiple categories. Accept both video annotation files or rawframe annotation files.
The dataset loads videos or raw frames and applies specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a json file with multiple dictionaries, and each dictionary indicates a sample video with the filename and tags, the tags are organized as different categories. Example of a video dictionary:
{ 'filename': 'gD_G1b0wV5I_001015_001035.mp4', 'label': { 'concept': [250, 131, 42, 51, 57, 155, 122], 'object': [1570, 508], 'event': [16], 'action': [180], 'scene': [206] } }Example of a rawframe dictionary:
{ 'frame_dir': 'gD_G1b0wV5I_001015_001035', 'total_frames': 61 'label': { 'concept': [250, 131, 42, 51, 57, 155, 122], 'object': [1570, 508], 'event': [16], 'action': [180], 'scene': [206] } }- Parameters
ann_file (str) – Path to the annotation file, should be a json file.
pipeline (list[dict | callable]) – A sequence of data transforms.
tag_categories (list[str]) – List of category names of tags.
tag_category_nums (list[int]) – List of number of tags in each category.
filename_tmpl (str | None) – Template for each filename. If set to None, video dataset is used. Default: None.
**kwargs – Keyword arguments for
BaseDataset.
-
evaluate(results, metrics='mean_average_precision', metric_options=None, logger=None)[source]¶ Evaluation in HVU Video Dataset. We only support evaluating mAP for each tag categories. Since some tag categories are missing for some videos, we can not evaluate mAP for all tags.
- Parameters
results (list) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mean_average_precision’.
metric_options (dict | None) – Dict for metric options. Default: None.
logger (logging.Logger | None) – Logger for recording. Default: None.
- Returns
Evaluation results dict.
- Return type
dict
-
class
mmaction.datasets.ImageDataset(ann_file, pipeline, **kwargs)[source]¶ Image dataset for action recognition, used in the Project OmniSource.
The dataset loads image list and apply specified transforms to return a dict containing the image tensors and other information. For the ImageDataset
The ann_file is a text file with multiple lines, and each line indicates the image path and the image label, which are split with a whitespace. Example of a annotation file:
path/to/image1.jpg 1 path/to/image2.jpg 1 path/to/image3.jpg 2 path/to/image4.jpg 2 path/to/image5.jpg 3 path/to/image6.jpg 3
Example of a multi-class annotation file:
path/to/image1.jpg 1 3 5 path/to/image2.jpg 1 2 path/to/image3.jpg 2 path/to/image4.jpg 2 4 6 8 path/to/image5.jpg 3 path/to/image6.jpg 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
**kwargs – Keyword arguments for
BaseDataset.
-
class
mmaction.datasets.MixupBlending(num_classes, alpha=0.2)[source]¶ Implementing Mixup in a mini-batch.
This module is proposed in mixup: Beyond Empirical Risk Minimization. Code Reference https://github.com/open-mmlab/mmclassification/blob/master/mmcls/models/utils/mixup.py # noqa
- Parameters
num_classes (int) – The number of classes.
alpha (float) – Parameters for Beta distribution.
-
class
mmaction.datasets.RawVideoDataset(ann_file, pipeline, clipname_tmpl='part_{}.mp4', sampling_strategy='positive', **kwargs)[source]¶ RawVideo dataset for action recognition, used in the Project OmniSource.
The dataset loads clips of raw videos and apply specified transforms to return a dict containing the frame tensors and other information. Not that for this dataset, multi_class should be False.
The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath (without suffix), label, number of clips and index of positive clips (starting from 0), which are split with a whitespace. Raw videos should be first trimmed into 10 second clips, organized in the following format:
some/path/D32_1gwq35E/part_0.mp4 some/path/D32_1gwq35E/part_1.mp4 ...... some/path/D32_1gwq35E/part_n.mp4
Example of a annotation file:
some/path/D32_1gwq35E 66 10 0 1 2 some/path/-G-5CJ0JkKY 254 5 3 4 some/path/T4h1bvOd9DA 33 1 0 some/path/4uZ27ivBl00 341 2 0 1 some/path/0LfESFkfBSw 186 234 7 9 11 some/path/-YIsNpBEx6c 169 100 9 10 11
The first line indicates that the raw video some/path/D32_1gwq35E has action label 66, consists of 10 clips (from part_0.mp4 to part_9.mp4). The 1st, 2nd and 3rd clips are positive clips.
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
sampling_strategy (str) – The strategy to sample clips from raw videos. Choices are ‘random’ or ‘positive’. Default: ‘positive’.
clipname_tmpl (str) – The template of clip name in the raw video. Default: ‘part_{}.mp4’.
**kwargs – Keyword arguments for
BaseDataset.
-
class
mmaction.datasets.RawframeDataset(ann_file, pipeline, data_prefix=None, test_mode=False, filename_tmpl='img_{:05}.jpg', with_offset=False, multi_class=False, num_classes=None, start_index=1, modality='RGB', sample_by_class=False, power=None)[source]¶ Rawframe dataset for action recognition.
The dataset loads raw frames and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace. Example of a annotation file:
some/directory-1 163 1 some/directory-2 122 1 some/directory-3 258 2 some/directory-4 234 2 some/directory-5 295 3 some/directory-6 121 3
Example of a multi-class annotation file:
some/directory-1 163 1 3 5 some/directory-2 122 1 2 some/directory-3 258 2 some/directory-4 234 2 4 6 8 some/directory-5 295 3 some/directory-6 121 3
Example of a with_offset annotation file (clips from long videos), each line indicates the directory to frames of a video, the index of the start frame, total frames of the video clip and the label of a video clip, which are split with a whitespace.
some/directory-1 12 163 3 some/directory-2 213 122 4 some/directory-3 100 258 5 some/directory-4 98 234 2 some/directory-5 0 295 3 some/directory-6 50 121 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
data_prefix (str | None) – Path to a directory where videos are held. Default: None.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
with_offset (bool) – Determines whether the offset information is in ann_file. Default: False.
multi_class (bool) – Determines whether it is a multi-class recognition dataset. Default: False.
num_classes (int | None) – Number of classes in the dataset. Default: None.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
sample_by_class (bool) – Sampling by class, should be set True when performing inter-class data balancing. Only compatible with multi_class == False. Only applies for training. Default: False.
power (float | None) – We support sampling data with the probability proportional to the power of its label frequency (freq ^ power) when sampling data. power == 1 indicates uniformly sampling all data; power == 0 indicates uniformly sampling all classes. Default: None.
-
class
mmaction.datasets.RepeatDataset(dataset, times)[source]¶ A wrapper of repeated dataset.
The length of repeated dataset will be
timeslarger than the original dataset. This is useful when the data loading time is long but the dataset is small. Using RepeatDataset can reduce the data loading time between epochs.- Parameters
dataset (
Dataset) – The dataset to be repeated.times (int) – Repeat times.
-
class
mmaction.datasets.SSNDataset(ann_file, pipeline, train_cfg, test_cfg, data_prefix, test_mode=False, filename_tmpl='img_{:05d}.jpg', start_index=1, modality='RGB', video_centric=True, reg_normalize_constants=None, body_segments=5, aug_segments=(2, 2), aug_ratio=(0.5, 0.5), clip_len=1, frame_interval=1, filter_gt=True, use_regression=True, verbose=False)[source]¶ Proposal frame dataset for Structured Segment Networks.
Based on proposal information, the dataset loads raw frames and applies specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines and each video’s information takes up several lines. This file can be a normalized file with percent or standard file with specific frame indexes. If the file is a normalized file, it will be converted into a standard file first.
Template information of a video in a standard file: .. code-block:: txt
# index video_id num_frames fps num_gts label, start_frame, end_frame label, start_frame, end_frame … num_proposals label, best_iou, overlap_self, start_frame, end_frame label, best_iou, overlap_self, start_frame, end_frame …
Example of a standard annotation file: .. code-block:: txt
# 0 video_validation_0000202 5666 1 3 8 130 185 8 832 1136 8 1303 1381 5 8 0.0620 0.0620 790 5671 8 0.1656 0.1656 790 2619 8 0.0833 0.0833 3945 5671 8 0.0960 0.0960 4173 5671 8 0.0614 0.0614 3327 5671
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
train_cfg (dict) – Config for training.
test_cfg (dict) – Config for testing.
data_prefix (str) – Path to a directory where videos are held.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
filename_tmpl (str) – Template for each filename. Default: ‘img_{:05}.jpg’.
start_index (int) – Specify a start index for frames in consideration of different filename format. Default: 1.
modality (str) – Modality of data. Support ‘RGB’, ‘Flow’. Default: ‘RGB’.
video_centric (bool) – Whether to sample proposals just from this video or sample proposals randomly from the entire dataset. Default: True.
reg_normalize_constants (list) – Regression target normalized constants, including mean and standard deviation of location and duration.
body_segments (int) – Number of segments in course period. Default: 5.
aug_segments (list[int]) – Number of segments in starting and ending period. Default: (2, 2).
aug_ratio (int | float | tuple[int | float]) – The ratio of the length of augmentation to that of the proposal. Defualt: (0.5, 0.5).
clip_len (int) – Frames of each sampled output clip. Default: 1.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
filter_gt (bool) – Whether to filter videos with no annotation during training. Default: True.
use_regression (bool) – Whether to perform regression. Default: True.
verbose (bool) – Whether to print full information or not. Default: False.
-
construct_proposal_pools()[source]¶ Construct positve proposal pool, incomplete proposal pool and background proposal pool of the entire dataset.
-
evaluate(results, metrics='mAP', metric_options={'mAP': {'eval_dataset': 'thumos14'}}, logger=None, **deprecated_kwargs)[source]¶ Evaluation in SSN proposal dataset.
- Parameters
results (list[dict]) – Output results.
metrics (str | sequence[str]) – Metrics to be performed. Defaults: ‘mAP’.
metric_options (dict) – Dict for metric options. Options are
eval_datasetformAP. Default:dict(mAP=dict(eval_dataset='thumos14')).logger (logging.Logger | None) – Logger for recording. Default: None.
deprecated_kwargs (dict) – Used for containing deprecated arguments. See ‘https://github.com/open-mmlab/mmaction2/pull/286’.
- Returns
Evaluation results for evaluation metrics.
- Return type
dict
-
static
get_negatives(proposals, incomplete_iou_threshold, background_iou_threshold, background_coverage_threshold=0.01, incomplete_overlap_threshold=0.7)[source]¶ Get negative proposals, including incomplete proposals and background proposals.
- Parameters
proposals (list) – List of proposal instances(
SSNInstance).incomplete_iou_threshold (float) – Maximum threshold of overlap of incomplete proposals and groundtruths.
background_iou_threshold (float) – Maximum threshold of overlap of background proposals and groundtruths.
background_coverage_threshold (float) – Minimum coverage of background proposals in video duration. Default: 0.01.
incomplete_overlap_threshold (float) – Minimum percent of incomplete proposals’ own span contained in a groundtruth instance. Default: 0.7.
- Returns
- (incompletes, backgrounds), incompletes
and backgrounds are lists comprised of incomplete proposal instances and background proposal instances.
- Return type
list[
SSNInstance]
-
static
get_positives(gts, proposals, positive_threshold, with_gt=True)[source]¶ Get positive/foreground proposals.
- Parameters
gts (list) – List of groundtruth instances(
SSNInstance).proposals (list) – List of proposal instances(
SSNInstance).positive_threshold (float) – Minimum threshold of overlap of positive/foreground proposals and groundtruths.
with_gt (bool) – Whether to include groundtruth instances in positive proposals. Default: True.
- Returns
- (positives), positives is a list
comprised of positive proposal instances.
- Return type
list[
SSNInstance]
-
class
mmaction.datasets.VideoDataset(ann_file, pipeline, start_index=0, **kwargs)[source]¶ Video dataset for action recognition.
The dataset loads raw videos and apply specified transforms to return a dict containing the frame tensors and other information.
The ann_file is a text file with multiple lines, and each line indicates a sample video with the filepath and label, which are split with a whitespace. Example of a annotation file:
some/path/000.mp4 1 some/path/001.mp4 1 some/path/002.mp4 2 some/path/003.mp4 2 some/path/004.mp4 3 some/path/005.mp4 3
- Parameters
ann_file (str) – Path to the annotation file.
pipeline (list[dict | callable]) – A sequence of data transforms.
start_index (int) – Specify a start index for frames in consideration of different filename format. However, when taking videos as input, it should be set to 0, since frames loaded from videos count from 0. Default: 0.
**kwargs – Keyword arguments for
BaseDataset.
-
mmaction.datasets.build_dataloader(dataset, videos_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, drop_last=False, pin_memory=True, **kwargs)[source]¶ Build PyTorch DataLoader.
In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.
- Parameters
dataset (
Dataset) – A PyTorch dataset.videos_per_gpu (int) – Number of videos on each GPU, i.e., batch size of each GPU.
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
num_gpus (int) – Number of GPUs. Only used in non-distributed training. Default: 1.
dist (bool) – Distributed training/test or not. Default: True.
shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.
seed (int | None) – Seed to be used. Default: None.
drop_last (bool) – Whether to drop the last incomplete batch in epoch. Default: False
pin_memory (bool) – Whether to use pin_memory in DataLoader. Default: True
kwargs (dict, optional) – Any keyword argument to be used to initialize DataLoader.
- Returns
A PyTorch dataloader.
- Return type
DataLoader
-
mmaction.datasets.build_dataset(cfg, default_args=None)[source]¶ Build a dataset from config dict.
- Parameters
cfg (dict) – Config dict. It should at least contain the key “type”.
default_args (dict | None, optional) – Default initialization arguments. Default: None.
- Returns
The constructed dataset.
- Return type
Dataset
pipelines¶
-
class
mmaction.datasets.pipelines.AudioAmplify(ratio)[source]¶ Amplify the waveform.
Required keys are “audios”, added or modified keys are “audios”, “amplify_ratio”.
- Parameters
ratio (float) – The ratio used to amplify the audio waveform.
-
class
mmaction.datasets.pipelines.AudioDecode(fixed_length=32000)[source]¶ Sample the audio w.r.t. the frames selected.
- Parameters
fixed_length (int) – As the audio clip selected by frames sampled may not be exactly the same, fixed_length will truncate or pad them into the same size. Default: 32000.
Required keys are “frame_inds”, “num_clips”, “total_frames”, “length”, added or modified keys are “audios”, “audios_shape”.
-
class
mmaction.datasets.pipelines.AudioDecodeInit(io_backend='disk', sample_rate=16000, pad_method='zero', **kwargs)[source]¶ Using librosa to initialize the audio reader.
Required keys are “audio_path”, added or modified keys are “length”, “sample_rate”, “audios”.
- Parameters
io_backend (str) – io backend where frames are store. Default: ‘disk’.
sample_rate (int) – Audio sampling times per second. Default: 16000.
-
class
mmaction.datasets.pipelines.AudioFeatureSelector(fixed_length=128)[source]¶ Sample the audio feature w.r.t. the frames selected.
Required keys are “audios”, “frame_inds”, “num_clips”, “length”, “total_frames”, added or modified keys are “audios”, “audios_shape”.
- Parameters
fixed_length (int) – As the features selected by frames sampled may not be extactly the same, fixed_length will truncate or pad them into the same size. Default: 128.
-
class
mmaction.datasets.pipelines.BuildPseudoClip(clip_len)[source]¶ Build pseudo clips with one single image by repeating it n times.
- Required key is “imgs”, added or modified key is “imgs”, “num_clips”,
“clip_len”.
- Parameters
clip_len (int) – Frames of the generated pseudo clips.
-
class
mmaction.datasets.pipelines.CenterCrop(crop_size, lazy=False)[source]¶ Crop the center area from images.
Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox”, “lazy” and “img_shape”. Required keys in “lazy” is “crop_bbox”, added or modified key is “crop_bbox”.
- Parameters
crop_size (int | tuple[int]) – (w, h) of crop size.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
-
class
mmaction.datasets.pipelines.Collect(keys, meta_keys=('filename', 'label', 'original_shape', 'img_shape', 'pad_shape', 'flip_direction', 'img_norm_cfg'), meta_name='img_metas', nested=False)[source]¶ Collect data from the loader relevant to the specific task.
This keeps the items in
keysas it is, and collect items inmeta_keysinto a meta item calledmeta_name.This is usually the last stage of the data loader pipeline. For example, when keys=’imgs’, meta_keys=(‘filename’, ‘label’, ‘original_shape’), meta_name=’img_metas’, the results will be a dict with keys ‘imgs’ and ‘img_metas’, where ‘img_metas’ is a DataContainer of another dict with keys ‘filename’, ‘label’, ‘original_shape’.- Parameters
keys (Sequence[str]) – Required keys to be collected.
meta_name (str) – The name of the key that contains meta infomation. This key is always populated. Default: “img_metas”.
meta_keys (Sequence[str]) –
Keys that are collected under meta_name. The contents of the
meta_namedictionary depends onmeta_keys. By default this includes:”filename”: path to the image file
”label”: label of the image file
- ”original_shape”: original shape of the image as a tuple
(h, w, c)
- ”img_shape”: shape of the image input to the network as a tuple
(h, w, c). Note that images may be zero padded on the bottom/right, if the batch tensor is larger than this shape.
”pad_shape”: image shape after padding
- ”flip_direction”: a str in (“horiziontal”, “vertival”) to
indicate if the image is fliped horizontally or vertically.
- ”img_norm_cfg”: a dict of normalization information:
mean - per channel mean subtraction
std - per channel std divisor
to_rgb - bool indicating if bgr was converted to rgb
nested (bool) – If set as True, will apply data[x] = [data[x]] to all items in data. The arg is added for compatibility. Default: False.
-
class
mmaction.datasets.pipelines.ColorJitter(color_space_aug=False, alpha_std=0.1, eig_val=None, eig_vec=None)[source]¶ Randomly distort the brightness, contrast, saturation and hue of images, and add PCA based noise into images.
Note: The input images should be in RGB channel order.
Code Reference: https://gluon-cv.mxnet.io/_modules/gluoncv/data/transforms/experimental/image.html https://mxnet.apache.org/api/python/docs/_modules/mxnet/image/image.html#LightingAug
If specified to apply color space augmentation, it will distort the image color space by changing brightness, contrast and saturation. Then, it will add some random distort to the images in different color channels. Note that the input images should be in original range [0, 255] and in RGB channel sequence.
Required keys are “imgs”, added or modified keys are “imgs”, “eig_val”, “eig_vec”, “alpha_std” and “color_space_aug”.
- Parameters
color_space_aug (bool) – Whether to apply color space augmentations. If specified, it will change the brightness, contrast, saturation and hue of images, then add PCA based noise to images. Otherwise, it will directly add PCA based noise to images. Default: False.
alpha_std (float) – Std in the normal Gaussian distribution of alpha.
eig_val (np.ndarray | None) – Eigenvalues of [1 x 3] size for RGB channel jitter. If set to None, it will use the default eigenvalues. Default: None.
eig_vec (np.ndarray | None) – Eigenvectors of [3 x 3] size for RGB channel jitter. If set to None, it will use the default eigenvectors. Default: None.
-
static
brightness(img, delta)[source]¶ Brightness distortion.
- Parameters
img (np.ndarray) – An input image.
delta (float) – Delta value to distort brightness. It ranges from [-32, 32).
- Returns
A brightness distorted image.
- Return type
np.ndarray
-
static
contrast(img, alpha)[source]¶ Contrast distortion.
- Parameters
img (np.ndarray) – An input image.
alpha (float) – Alpha value to distort contrast. It ranges from [0.6, 1.4).
- Returns
A contrast distorted image.
- Return type
np.ndarray
-
class
mmaction.datasets.pipelines.Compose(transforms)[source]¶ Compose a data pipeline with a sequence of transforms.
- Parameters
transforms (list[dict | callable]) – Either config dicts of transforms or transform objects.
-
class
mmaction.datasets.pipelines.DecordDecode[source]¶ Using decord to decode the video.
Decord: https://github.com/dmlc/decord
Required keys are “video_reader”, “filename” and “frame_inds”, added or modified keys are “imgs” and “original_shape”.
-
class
mmaction.datasets.pipelines.DecordInit(io_backend='disk', num_threads=1, **kwargs)[source]¶ Using decord to initialize the video_reader.
Decord: https://github.com/dmlc/decord
Required keys are “filename”, added or modified keys are “video_reader” and “total_frames”.
-
class
mmaction.datasets.pipelines.DenseSampleFrames(clip_len, frame_interval=1, num_clips=1, sample_range=64, num_sample_positions=10, temporal_jitter=False, out_of_bound_opt='loop', test_mode=False)[source]¶ Select frames from the video by dense sample strategy.
Required keys are “filename”, added or modified keys are “total_frames”, “frame_inds”, “frame_interval” and “num_clips”.
- Parameters
clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
num_clips (int) – Number of clips to be sampled. Default: 1.
sample_range (int) – Total sample range for dense sample. Default: 64.
num_sample_positions (int) – Number of sample start positions, Which is only used in test mode. Default: 10. That is to say, by default, there are at least 10 clips for one input sample in test mode.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
-
class
mmaction.datasets.pipelines.Flip(flip_ratio=0.5, direction='horizontal', flip_label_map=None, lazy=False)[source]¶ Flip the input images with a probability.
Reverse the order of elements in the given imgs with a specific direction. The shape of the imgs is preserved, but the elements are reordered. Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “lazy” and “flip_direction”. Required keys in “lazy” is None, added or modified key are “flip” and “flip_direction”. The Flip augmentation should be placed after any cropping / reshaping augmentations, to make sure crop_quadruple is calculated properly.
- Parameters
flip_ratio (float) – Probability of implementing flip. Default: 0.5.
direction (str) – Flip imgs horizontally or vertically. Options are “horizontal” | “vertical”. Default: “horizontal”.
flip_label_map (Dict[int, int] | None) – Transform the label of the flipped image with the specific label. Default: None.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
-
class
mmaction.datasets.pipelines.FormatAudioShape(input_format)[source]¶ Format final audio shape to the given input_format.
Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.
- Parameters
input_format (str) – Define the final imgs format.
-
class
mmaction.datasets.pipelines.FormatShape(input_format, collapse=False)[source]¶ Format final imgs shape to the given input_format.
Required keys are “imgs”, “num_clips” and “clip_len”, added or modified keys are “imgs” and “input_shape”.
- Parameters
input_format (str) – Define the final imgs format.
collapse (bool) – To collpase input_format N… to … (NCTHW to CTHW, etc.) if N is 1. Should be set as True when training and testing detectors. Default: False.
-
class
mmaction.datasets.pipelines.FrameSelector(*args, **kwargs)[source]¶ Deprecated class for
RawFrameDecode.
-
class
mmaction.datasets.pipelines.Fuse[source]¶ Fuse lazy operations.
- Fusion order:
crop -> resize -> flip
Required keys are “imgs”, “img_shape” and “lazy”, added or modified keys are “imgs”, “lazy”. Required keys in “lazy” are “crop_bbox”, “interpolation”, “flip_direction”.
-
class
mmaction.datasets.pipelines.GenerateLocalizationLabels[source]¶ Load video label for localizer with given video_name list.
Required keys are “duration_frame”, “duration_second”, “feature_frame”, “annotations”, added or modified keys are “gt_bbox”.
-
class
mmaction.datasets.pipelines.ImageDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]¶ Load and decode images.
Required key is “filename”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- Parameters
io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.
kwargs (dict, optional) – Arguments for FileClient.
-
class
mmaction.datasets.pipelines.ImageToTensor(keys)[source]¶ Convert image type to torch.Tensor type.
- Parameters
keys (Sequence[str]) – Required keys to be converted.
-
class
mmaction.datasets.pipelines.Imgaug(transforms)[source]¶ Imgaug augmentation.
Adds custom transformations from imgaug library. Please visit https://imgaug.readthedocs.io/en/latest/index.html to get more information. Two demo configs could be found in tsn and i3d config folder.
It’s better to use uint8 images as inputs since imgaug works best with numpy dtype uint8 and isn’t well tested with other dtypes. It should be noted that not all of the augmenters have the same input and output dtype, which may cause unexpected results.
Required keys are “imgs”, “img_shape”(if “gt_bboxes” is not None) and “modality”, added or modified keys are “imgs”, “img_shape”, “gt_bboxes” and “proposals”.
It is worth mentioning that Imgaug will NOT create custom keys like “interpolation”, “crop_bbox”, “flip_direction”, etc. So when using Imgaug along with other mmaction2 pipelines, we should pay more attention to required keys.
Two steps to use Imgaug pipeline: 1. Create initialization parameter transforms. There are three ways
to create transforms. 1) string: only support default for now.
e.g. transforms=’default’
- list[dict]: create a list of augmenters by a list of dicts, each
dict corresponds to one augmenter. Every dict MUST contain a key named type. type should be a string(iaa.Augmenter’s name) or an iaa.Augmenter subclass. e.g. transforms=[dict(type=’Rotate’, rotate=(-20, 20))] e.g. transforms=[dict(type=iaa.Rotate, rotate=(-20, 20))]
- iaa.Augmenter: create an imgaug.Augmenter object.
e.g. transforms=iaa.Rotate(rotate=(-20, 20))
- Add Imgaug in dataset pipeline. It is recommended to insert imgaug
pipeline before Normalize. A demo pipeline is listed as follows. ``` pipeline = [
- dict(
type=’SampleFrames’, clip_len=1, frame_interval=1, num_clips=16,
), dict(type=’RawFrameDecode’), dict(type=’Resize’, scale=(-1, 256)), dict(
type=’MultiScaleCrop’, input_size=224, scales=(1, 0.875, 0.75, 0.66), random_crop=False, max_wh_scale_gap=1, num_fixed_crops=13),
dict(type=’Resize’, scale=(224, 224), keep_ratio=False), dict(type=’Flip’, flip_ratio=0.5), dict(type=’Imgaug’, transforms=’default’), # dict(type=’Imgaug’, transforms=[ # dict(type=’Rotate’, rotate=(-20, 20)) # ]), dict(type=’Normalize’, **img_norm_cfg), dict(type=’FormatShape’, input_format=’NCHW’), dict(type=’Collect’, keys=[‘imgs’, ‘label’], meta_keys=[]), dict(type=’ToTensor’, keys=[‘imgs’, ‘label’])
- Parameters
transforms (str | list[dict] |
iaa.Augmenter) – Three different ways to create imgaug augmenter.
-
default_transforms()[source]¶ Default transforms for imgaug.
Implement RandAugment by imgaug. Plase visit https://arxiv.org/abs/1909.13719 for more information.
Augmenters and hyper parameters are borrowed from the following repo: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py # noqa
Miss one augmenter
SolarizeAddsince imgaug doesn’t support this.- Returns
The constructed RandAugment transforms.
- Return type
dict
-
imgaug_builder(cfg)[source]¶ Import a module from imgaug.
It follows the logic of
build_from_cfg(). Use a dict object to create an iaa.Augmenter object.- Parameters
cfg (dict) – Config dict. It should at least contain the key “type”.
- Returns
iaa.Augmenter: The constructed imgaug augmenter.
- Return type
obj
-
class
mmaction.datasets.pipelines.LoadAudioFeature(pad_method='zero')[source]¶ Load offline extracted audio features.
Required keys are “audio_path”, added or modified keys are “length”, audios”.
-
class
mmaction.datasets.pipelines.LoadHVULabel(**kwargs)[source]¶ Convert the HVU label from dictionaries to torch tensors.
Required keys are “label”, “categories”, “category_nums”, added or modified keys are “label”, “mask” and “category_mask”.
-
class
mmaction.datasets.pipelines.LoadLocalizationFeature(raw_feature_ext='.csv')[source]¶ Load Video features for localizer with given video_name list.
Required keys are “video_name” and “data_prefix”, added or modified keys are “raw_feature”.
- Parameters
raw_feature_ext (str) – Raw feature file extension. Default: ‘.csv’.
-
class
mmaction.datasets.pipelines.LoadProposals(top_k, pgm_proposals_dir, pgm_features_dir, proposal_ext='.csv', feature_ext='.npy')[source]¶ Loading proposals with given proposal results.
Required keys are “video_name”, added or modified keys are ‘bsp_feature’, ‘tmin’, ‘tmax’, ‘tmin_score’, ‘tmax_score’ and ‘reference_temporal_iou’.
- Parameters
top_k (int) – The top k proposals to be loaded.
pgm_proposals_dir (str) – Directory to load proposals.
pgm_features_dir (str) – Directory to load proposal features.
proposal_ext (str) – Proposal file extension. Default: ‘.csv’.
feature_ext (str) – Feature file extension. Default: ‘.npy’.
-
class
mmaction.datasets.pipelines.MelSpectrogram(window_size=32, step_size=16, n_mels=80, fixed_length=128)[source]¶ MelSpectrogram. Transfer an audio wave into a melspectogram figure.
Required keys are “audios”, “sample_rate”, “num_clips”, added or modified keys are “audios”.
- Parameters
window_size (int) – The window size in milisecond. Default: 32.
step_size (int) – The step size in milisecond. Default: 16.
n_mels (int) – Number of mels. Default: 80.
fixed_length (int) – The sample length of melspectrogram maybe not exactly as wished due to different fps, fix the length for batch collation by truncating or padding. Default: 128.
-
class
mmaction.datasets.pipelines.MultiGroupCrop(crop_size, groups)[source]¶ Randomly crop the images into several groups.
Crop the random region with the same given crop_size and bounding box into several groups. Required keys are “imgs”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.
- Parameters
crop_size (int | tuple[int]) – (w, h) of crop size.
groups (int) – Number of groups.
-
class
mmaction.datasets.pipelines.MultiScaleCrop(input_size, scales=(1), max_wh_scale_gap=1, random_crop=False, num_fixed_crops=5, lazy=False)[source]¶ Crop images with a list of randomly selected scales.
Randomly select the w and h scales from a list of scales. Scale of 1 means the base size, which is the minimal of image width and height. The scale level of w and h is controlled to be smaller than a certain value to prevent too large or small aspect ratio. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox”, “img_shape”, “lazy” and “scales”. Required keys in “lazy” are “crop_bbox”, added or modified key is “crop_bbox”.
- Parameters
input_size (int | tuple[int]) – (w, h) of network input.
scales (tuple[float]) – width and height scales to be selected.
max_wh_scale_gap (int) – Maximum gap of w and h scale levels. Default: 1.
random_crop (bool) – If set to True, the cropping bbox will be randomly sampled, otherwise it will be sampler from fixed regions. Default: False.
num_fixed_crops (int) – If set to 5, the cropping bbox will keep 5 basic fixed regions: “upper left”, “upper right”, “lower left”, “lower right”, “center”. If set to 13, the cropping bbox will append another 8 fix regions: “center left”, “center right”, “lower center”, “upper center”, “upper left quarter”, “upper right quarter”, “lower left quarter”, “lower right quarter”. Default: 5.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
-
class
mmaction.datasets.pipelines.Normalize(mean, std, to_bgr=False, adjust_magnitude=False)[source]¶ Normalize images with the given mean and std value.
Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs” and “img_norm_cfg”. If modality is ‘Flow’, additional keys “scale_factor” is required
- Parameters
mean (Sequence[float]) – Mean values of different channels.
std (Sequence[float]) – Std values of different channels.
to_bgr (bool) – Whether to convert channels from RGB to BGR. Default: False.
adjust_magnitude (bool) – Indicate whether to adjust the flow magnitude on ‘scale_factor’ when modality is ‘Flow’. Default: False.
-
class
mmaction.datasets.pipelines.OpenCVDecode[source]¶ Using OpenCV to decode the video.
Required keys are “video_reader”, “filename” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
-
class
mmaction.datasets.pipelines.OpenCVInit(io_backend='disk', **kwargs)[source]¶ Using OpenCV to initialize the video_reader.
Required keys are “filename”, added or modified keys are “new_path”, “video_reader” and “total_frames”.
-
class
mmaction.datasets.pipelines.PyAVDecode(multi_thread=False)[source]¶ Using pyav to decode the video.
PyAV: https://github.com/mikeboers/PyAV
Required keys are “video_reader” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- Parameters
multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.
-
class
mmaction.datasets.pipelines.PyAVDecodeMotionVector(multi_thread=False)[source]¶ Using pyav to decode the motion vectors from video.
- Reference: https://github.com/PyAV-Org/PyAV/
blob/main/tests/test_decode.py
Required keys are “video_reader” and “frame_inds”, added or modified keys are “motion_vectors”, “frame_inds”.
- Parameters
multi_thread (bool) – If set to True, it will apply multi thread processing. Default: False.
-
class
mmaction.datasets.pipelines.PyAVInit(io_backend='disk', **kwargs)[source]¶ Using pyav to initialize the video.
PyAV: https://github.com/mikeboers/PyAV
Required keys are “filename”, added or modified keys are “video_reader”, and “total_frames”.
- Parameters
io_backend (str) – io backend where frames are store. Default: ‘disk’.
kwargs (dict) – Args for file client.
-
class
mmaction.datasets.pipelines.RandomCrop(size, lazy=False)[source]¶ Vanilla square random crop that specifics the output size.
Required keys in results are “imgs” and “img_shape”, added or modified keys are “imgs”, “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.
- Parameters
size (int) – The output size of the images.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
-
class
mmaction.datasets.pipelines.RandomRescale(scale_range, interpolation='bilinear')[source]¶ Randomly resize images so that the short_edge is resized to a specific size in a given range. The scale ratio is unchanged after resizing.
Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “resize_size”, “short_edge”.
- Parameters
scale_range (tuple[int]) – The range of short edge length. A closed interval.
interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.
-
class
mmaction.datasets.pipelines.RandomResizedCrop(area_range=(0.08, 1.0), aspect_ratio_range=(0.75, 1.3333333333333333), lazy=False)[source]¶ Random crop that specifics the area and height-weight ratio range.
Required keys in results are “imgs”, “img_shape”, “crop_bbox” and “lazy”, added or modified keys are “imgs”, “crop_bbox” and “lazy”; Required keys in “lazy” are “flip”, “crop_bbox”, added or modified key is “crop_bbox”.
- Parameters
area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).
aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3).
lazy (bool) – Determine whether to apply lazy operation. Default: False.
-
static
get_crop_bbox(img_shape, area_range, aspect_ratio_range, max_attempts=10)[source]¶ Get a crop bbox given the area range and aspect ratio range.
- Parameters
img_shape (Tuple[int]) – Image shape
area_range (Tuple[float]) – The candidate area scales range of output cropped images. Default: (0.08, 1.0).
aspect_ratio_range (Tuple[float]) – The candidate aspect ratio range of output cropped images. Default: (3 / 4, 4 / 3). max_attempts (int): The maximum of attempts. Default: 10.
max_attempts (int) – Max attempts times to generate random candidate bounding box. If it doesn’t qualified one, the center bounding box will be used.
- Returns
(list[int]) A random crop bbox within the area range and aspect ratio range.
-
class
mmaction.datasets.pipelines.RandomScale(scales, mode='range', **kwargs)[source]¶ Resize images by a random scale.
Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “scale”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.
- Parameters
scales (tuple[int]) – Tuple of scales to be chosen for resize.
mode (str) – Selection mode for choosing the scale. Options are “range” and “value”. If set to “range”, The short edge will be randomly chosen from the range of minimum and maximum on the shorter one in all tuples. Otherwise, the longer edge will be randomly chosen from the range of minimum and maximum on the longer one in all tuples. Default: ‘range’.
-
class
mmaction.datasets.pipelines.RawFrameDecode(io_backend='disk', decoding_backend='cv2', **kwargs)[source]¶ Load and decode frames with given indices.
Required keys are “frame_dir”, “filename_tmpl” and “frame_inds”, added or modified keys are “imgs”, “img_shape” and “original_shape”.
- Parameters
io_backend (str) – IO backend where frames are stored. Default: ‘disk’.
decoding_backend (str) – Backend used for image decoding. Default: ‘cv2’.
kwargs (dict, optional) – Arguments for FileClient.
-
class
mmaction.datasets.pipelines.Rename(mapping)[source]¶ Rename the key in results.
- Parameters
mapping (dict) – The keys in results that need to be renamed. The key of the dict is the original name, while the value is the new name. If the original name not found in results, do nothing. Default: dict().
-
class
mmaction.datasets.pipelines.Resize(scale, keep_ratio=True, interpolation='bilinear', lazy=False)[source]¶ Resize images to a specific size.
Required keys are “imgs”, “img_shape”, “modality”, added or modified keys are “imgs”, “img_shape”, “keep_ratio”, “scale_factor”, “lazy”, “resize_size”. Required keys in “lazy” is None, added or modified key is “interpolation”.
- Parameters
scale (float | Tuple[int]) – If keep_ratio is True, it serves as scaling factor or maximum size: If it is a float number, the image will be rescaled by this factor, else if it is a tuple of 2 integers, the image will be rescaled as large as possible within the scale. Otherwise, it serves as (w, h) of output size.
keep_ratio (bool) – If set to True, Images will be resized without changing the aspect ratio. Otherwise, it will resize images to a given size. Default: True.
interpolation (str) – Algorithm used for interpolation: “nearest” | “bilinear”. Default: “bilinear”.
lazy (bool) – Determine whether to apply lazy operation. Default: False.
-
class
mmaction.datasets.pipelines.SampleAVAFrames(clip_len, frame_interval=2, test_mode=False)[source]¶
-
class
mmaction.datasets.pipelines.SampleFrames(clip_len, frame_interval=1, num_clips=1, temporal_jitter=False, twice_sample=False, out_of_bound_opt='loop', test_mode=False, start_index=None)[source]¶ Sample frames from the video.
Required keys are “filename”, “total_frames”, “start_index” , added or modified keys are “frame_inds”, “frame_interval” and “num_clips”.
- Parameters
clip_len (int) – Frames of each sampled output clip.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
num_clips (int) – Number of clips to be sampled. Default: 1.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
twice_sample (bool) – Whether to use twice sample when testing. If set to True, it will sample frames with and without fixed shift, which is commonly used for testing in TSM model. Default: False.
out_of_bound_opt (str) – The way to deal with out of bounds frame indexes. Available options are ‘loop’, ‘repeat_last’. Default: ‘loop’.
test_mode (bool) – Store True when building test or validation dataset. Default: False.
start_index (None) – This argument is deprecated and moved to dataset class (
BaseDataset,VideoDatset,RawframeDataset, etc), see this: https://github.com/open-mmlab/mmaction2/pull/89.
-
class
mmaction.datasets.pipelines.SampleProposalFrames(clip_len, body_segments, aug_segments, aug_ratio, frame_interval=1, test_interval=6, temporal_jitter=False, mode='train')[source]¶ Sample frames from proposals in the video.
Required keys are “total_frames” and “out_proposals”, added or modified keys are “frame_inds”, “frame_interval”, “num_clips”, ‘clip_len’ and ‘num_proposals’.
- Parameters
clip_len (int) – Frames of each sampled output clip.
body_segments (int) – Number of segments in course period.
aug_segments (list[int]) – Number of segments in starting and ending period.
aug_ratio (int | float | tuple[int | float]) – The ratio of the length of augmentation to that of the proposal.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 1.
test_interval (int) – Temporal interval of adjacent sampled frames in test mode. Default: 6.
temporal_jitter (bool) – Whether to apply temporal jittering. Default: False.
mode (str) – Choose ‘train’, ‘val’ or ‘test’ mode. Default: ‘train’.
-
class
mmaction.datasets.pipelines.TenCrop(crop_size)[source]¶ Crop the images into 10 crops (corner + center + flip).
Crop the four corners and the center part of the image with the same given crop_size, and flip it horizontally. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.
- Parameters
crop_size (int | tuple[int]) – (w, h) of crop size.
-
class
mmaction.datasets.pipelines.ThreeCrop(crop_size)[source]¶ Crop images into three crops.
Crop the images equally into three crops with equal intervals along the shorter side. Required keys are “imgs”, “img_shape”, added or modified keys are “imgs”, “crop_bbox” and “img_shape”.
- Parameters
crop_size (int | tuple[int]) – (w, h) of crop size.
-
class
mmaction.datasets.pipelines.ToDataContainer(fields)[source]¶ Convert the data to DataContainer.
- Parameters
fields (Sequence[dict]) – Required fields to be converted with keys and attributes. E.g. fields=(dict(key=’gt_bbox’, stack=False),). Note that key can also be a list of keys, if so, every tensor in the list will be converted to DataContainer.
-
class
mmaction.datasets.pipelines.ToTensor(keys)[source]¶ Convert some values in results dict to torch.Tensor type in data loader pipeline.
- Parameters
keys (Sequence[str]) – Required keys to be converted.
-
class
mmaction.datasets.pipelines.Transpose(keys, order)[source]¶ Transpose image channels to a given order.
- Parameters
keys (Sequence[str]) – Required keys to be converted.
order (Sequence[int]) – Image channel order.
-
class
mmaction.datasets.pipelines.UntrimmedSampleFrames(clip_len=1, frame_interval=16, start_index=None)[source]¶ Sample frames from the untrimmed video.
Required keys are “filename”, “total_frames”, added or modified keys are “frame_inds”, “frame_interval” and “num_clips”.
- Parameters
clip_len (int) – The length of sampled clips. Default: 1.
frame_interval (int) – Temporal interval of adjacent sampled frames. Default: 16.
start_index (None) – This argument is deprecated and moved to dataset class (
BaseDataset,VideoDatset,RawframeDataset, etc), see this: https://github.com/open-mmlab/mmaction2/pull/89.
samplers¶
-
class
mmaction.datasets.samplers.DistributedPowerSampler(dataset, num_replicas=None, rank=None, power=1, seed=0)[source]¶ DistributedPowerSampler inheriting from
torch.utils.data.DistributedSampler.Samples are sampled with the probability that is proportional to the power of label frequency (freq ^ power). The sampler only applies to single class recognition dataset.
The default value of power is 1, which is equivalent to bootstrap sampling from the entire dataset.
-
class
mmaction.datasets.samplers.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0)[source]¶ DistributedSampler inheriting from
torch.utils.data.DistributedSampler.In pytorch of lower versions, there is no
shuffleargument. This child class will port one to DistributedSampler.
mmaction.utils¶
-
class
mmaction.utils.GradCAM(model, target_layer_name, colormap='viridis')[source]¶ GradCAM class helps create visualization results.
Visualization results are blended by heatmaps and input images. This class is modified from https://github.com/facebookresearch/SlowFast/blob/master/slowfast/visualization/gradcam_utils.py # noqa For more information about GradCAM, please visit: https://arxiv.org/pdf/1610.02391.pdf
-
class
mmaction.utils.PreciseBNHook(dataloader, num_iters=200, interval=1)[source]¶ Precise BN hook.
-
dataloader¶ A PyTorch dataloader.
- Type
DataLoader
-
num_iters¶ Number of iterations to update the bn stats. Default: 200.
- Type
int
-
interval¶ Perform precise bn interval (by epochs). Default: 1.
- Type
int
-
-
mmaction.utils.get_random_string(length=15)[source]¶ Get random string with letters and digits.
- Parameters
length (int) – Length of random string. Default: 15.
-
mmaction.utils.get_root_logger(log_file=None, log_level=20)[source]¶ Use
get_loggermethod in mmcv to get the root logger.The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If
log_fileis specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmaction”.- Parameters
log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.
log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.
- Returns
The root logger.
- Return type
logging.Logger
mmaction.localization¶
-
mmaction.localization.eval_ap(detections, gt_by_cls, iou_range)[source]¶ Evaluate average precisions.
- Parameters
detections (dict) – Results of detections.
gt_by_cls (dict) – Information of groudtruth.
iou_range (list) – Ranges of iou.
- Returns
Average precision values of classes at ious.
- Return type
list
-
mmaction.localization.generate_bsp_feature(video_list, video_infos, tem_results_dir, pgm_proposals_dir, top_k=1000, bsp_boundary_ratio=0.2, num_sample_start=8, num_sample_end=8, num_sample_action=16, num_sample_interp=3, tem_results_ext='.csv', pgm_proposal_ext='.csv', result_dict=None)[source]¶ Generate Boundary-Sensitive Proposal Feature with given proposals.
- Parameters
video_list (list[int]) – List of video indexs to generate bsp_feature.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’.
tem_results_dir (str) – Directory to load temporal evaluation results.
pgm_proposals_dir (str) – Directory to load proposals.
top_k (int) – Number of proposals to be considered. Default: 1000
bsp_boundary_ratio (float) – Ratio for proposal boundary (start/end). Default: 0.2.
num_sample_start (int) – Num of samples for actionness in start region. Default: 8.
num_sample_end (int) – Num of samples for actionness in end region. Default: 8.
num_sample_action (int) – Num of samples for actionness in center region. Default: 16.
num_sample_interp (int) – Num of samples for interpolation for each sample point. Default: 3.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
pgm_proposal_ext (str) – File extension for proposals. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.
- Returns
- A dict contains video_name as keys and
bsp_feature as value. If result_dict is not None, save the results to it.
- Return type
bsp_feature_dict (dict)
-
mmaction.localization.generate_candidate_proposals(video_list, video_infos, tem_results_dir, temporal_scale, peak_threshold, tem_results_ext='.csv', result_dict=None)[source]¶ Generate Candidate Proposals with given temporal evalutation results. Each proposal file will contain: ‘tmin,tmax,tmin_score,tmax_score,score,match_iou,match_ioa’.
- Parameters
video_list (list[int]) – List of video indexs to generate proposals.
video_infos (list[dict]) – List of video_info dict that contains ‘video_name’, ‘duration_frame’, ‘duration_second’, ‘feature_frame’, and ‘annotations’.
tem_results_dir (str) – Directory to load temporal evaluation results.
temporal_scale (int) – The number (scale) on temporal axis.
peak_threshold (float) – The threshold for proposal generation.
tem_results_ext (str) – File extension for temporal evaluation model output. Default: ‘.csv’.
result_dict (dict | None) – The dict to save the results. Default: None.
- Returns
- A dict contains video_name as keys and proposal list as value.
If result_dict is not None, save the results to it.
- Return type
dict
-
mmaction.localization.load_localize_proposal_file(filename)[source]¶ Load the proposal file and split it into many parts which contain one video’s information separately.
- Parameters
filename (str) – Path to the proposal file.
- Returns
List of all videos’ information.
- Return type
list
-
mmaction.localization.perform_regression(detections)[source]¶ Perform regression on detection results.
- Parameters
detections (list) – Detection results before regression.
- Returns
Detection results after regression.
- Return type
list
-
mmaction.localization.soft_nms(proposals, alpha, low_threshold, high_threshold, top_k)[source]¶ Soft NMS for temporal proposals.
- Parameters
proposals (np.ndarray) – Proposals generated by network.
alpha (float) – Alpha value of Gaussian decaying function.
low_threshold (float) – Low threshold for soft nms.
high_threshold (float) – High threshold for soft nms.
top_k (int) – Top k values to be considered.
- Returns
The updated proposals.
- Return type
np.ndarray
-
mmaction.localization.temporal_iop(proposal_min, proposal_max, gt_min, gt_max)[source]¶ Compute IoP score between a groundtruth bbox and the proposals.
Compute the IoP which is defined as the overlap ratio with groundtruth proportional to the duration of this proposal.
- Parameters
proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.
- Returns
List of intersection over anchor scores.
- Return type
list[float]
-
mmaction.localization.temporal_iou(proposal_min, proposal_max, gt_min, gt_max)[source]¶ Compute IoU score between a groundtruth bbox and the proposals.
- Parameters
proposal_min (list[float]) – List of temporal anchor min.
proposal_max (list[float]) – List of temporal anchor max.
gt_min (float) – Groundtruth temporal box min.
gt_max (float) – Groundtruth temporal box max.
- Returns
List of iou scores.
- Return type
list[float]