Paper--Fall Surveys

2021-02-07

Fall detection
Human activity recognition
- Paper VII: Vision-based human activity recognition: a survey, 2020
- 3D and depth data
  - Paper VIII: Human activity recognition from 3d data: a review, 2014
  - RGB-D-based Human Motion Recognition with Deep Learning: A Survey, 2017
- 3D skeleton-based human representations
- knowledge-based HAR activity recognition
  - A survey on using domain and contextual knowledge for human activity recognition in video streams.,2016
- Abnormal HAR
  - A review of state-of-the-art techniques for abnormal human activity recognition, 2018

Fall detection

Paper I: 2020

Elderly Fall Detection Systems: A Literature Survey, 2020

Why?

Types of falls

Types
- forward, lateral and backward in El-Bendary et al.
- forward, backward, left-side, right-side, blinded-forward and blinded-backward in Putra et al.
- fall lateral left lie on the floor, fall lateral left and sit up from floor, fall lateral right and lie on the floor, fall lateral and left sit up from the floor, fall forward and lie on the floor, and fall backward and lie on the floor in Chen et al
Elderly people may suffer from longer duration of falls, because of motion with low speed in the activity of daily living
The characteristics of different types of falls are not taken into consideration in most of the work on fall detection surveyed. (like age, gender etc.)

Previous work and falls

Paper	contents	summary
Chaudhuri et al. (2014)	fall detection devices for people of different ages (excluding children) from several perspectives, including background, objectives, data sources, eligibility criteria, and intervention methods.	most of the studies were based on synthetic data
Zhang et al. (2015)	vision-based fall detection systems and their related benchmark data sets. Methods are divided into four categories, namely individual single RGB cameras, infrared cameras, depth cameras, and 3D-based methods using camera arrays. Methods are also divided to rely on the activity/inactivity of the subjects, shape (width-to-height ratio), and motion.	Non-vision sensors are not included
Cai et al. (2017)	depth cameras, reviewed the benchmark data sets acquired by Microsoft Kinect and similar cameras.	helpful for looking for benchmark data sets
Chen et al. (2017a)	vision- and non-vision-based systems	fusion of depth cameras and inertial sensor resulted in a system that is more robust
Igual et al. (2013)	low-cost cameras and accelerometers embedded in smartphones may offer the most sensible technological choice for the investigation of fall detection. They also reported three main challenges: (i) real-world deployment performance, (ii) usability, and (iii) acceptance.	another option of sensors

Goals

provide a literature survey of work conducted on elderly fall detection using sensor networks and IoT in terms of data acquisition, data analysis, data transport and storage, sensor networks and Internet of Things (IoT) platforms, as well as security and privacy.

How?

The components of system

Sensors

Individual wearable sensors
- possible choices: accelerometers, gyroscopes, glucometers, pressure sensors, ECG (Electrocardiography), EEG (Electroencephalography), or EOG (Electromyography)
- Bourke et al. (2007) found that accelerometers are regarded as the most popular sensors for fall detection. Smart phones are more practical compared with wearable sensors.
Individual vision sensors
- Possible choices: infrared, RGB, RGB-D etc.
- main challenge of vision-based detection is the potential violation of privacy
- Almost use synthetic falling dataset. Boulard et al. (2014) has actual fall data and the other by Stone and Skubic (2015) has mixed data.
Individual ambient sensors
- Possible choices: active infrared RFID,pressure, smart tiles, magnetic switches, doppler radar, ultrasonic and microphone.
Subjects
- simulated data from OpenSim contributed to an increase in performance to the resulting models
- transfer learning which adapt to subjects who were not represented in the training data
- reinforcement learning for different subjects

Sensor fusion

Types
feature fusion is the most popular approach, followed by decision fusion.

Trends & challenges

Trends

Sensor fusion
ML,DL and RL
With 5G wireless networks
Data augmentation
- Personalized data: the historical medical and behavioral data of Individuals (El-Bendary et al. (2013) and Namba and Yamada (2018b))
- Use skeletal models for simulation: Mastorakis et al. (2007, 2018), applied the skeletal model simulated in Opensim,
Fog computing : Intel RealSense includes a 28 nanometer processor

Challenges

The rarity of data of real falls : only Liu et al. (2014) used a data set with nine real falls along with 445 simulated ones.
Detection in real-time
Security and privacy
Platform of sensor fusion
Limitation of locations: indoor and outdoor environments
Scalability and flexibility

Paper II: vision

A survey on vision-based fall detection, 2015

Goal

focus on recent vision-based fall detection techniques, including depth-based methods;
discuss the publicly available fall datasets.

Fall datasets

[1] Definition and performance evaluation of a robust svm based fall detection solution.

[2] Multiple cameras fall dataset

The EDF dataset mentioned in this paper, I cannot find the original paper except for this one , in which EDF and OCCU are both created. But if it's the paper I list, the description of EDF in this paper is not totally correct.

-	SDUFall	EDF?	OCCU	[1]	[2]
camera type	1 kinetic	2 kinetics	2 kinetics	1 RGB camera	8 calibrated RGB camera
camera viewpoints	1	2	2	NaN	8
fall type	falls with different directions	eight fall directions	occluded falls	falls with different directions	forward, backward falls, falls from sitting down and loss of balance
#falls	200	320	60	192	200
actions in daily life?	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$
#scenarios	1	1	1	4 (home, coffee room, office, lecture room)	24
#subjects	20	10	5		1
resolution	$320\times240$	$320\times240$		$320\times240$
frame rate	30 fps	25 fps	30 fps	25 fps
year	2014	2008	2014	2012	2010

Vision-based fall detectors

1	Comment: the classification rules for each section are not the same.

Using single RGB camera

Firstly classify by features, then the tasks of it is mentioned

Shape-related features : based on width to height aspect ratio of the person.
- background separation to get silhouette and then use SVM for classification:
- shape deformation from silhouettes and use shape deformation based GMM for classification
- ellipse shape for body modeling and GMM for extracting moving object.
- 2 HMMs for classify falls and normal activities
- extracting projection histograms of the segmented body silhouette and then use it as feature vector, complete posture classification by KNN.
Motion pattern
- human motion analysis (analyzing the energy of the motion active area ) + human silhouette shape variations to detect slip-only and fall events
- applying Integrated Time Motion Image (ITMI) to fall detection. ITMI is the calculated the PCs of typical video clip for representing a motion pattern.
- threshold-based by the last few frames (falling, the magnitude of the fall, the maximum velocity of the fall ) etc.
- extract the 3D head trajectory using a single calibrated camera.
- extract foreground human silhouette via background modeling. Ellipse fitting for human body and analyze silhouette motion by an integrated normalized motion energy image.
Inactivity detection
- use ceiling-mounted, wide-angle cameras with vertically oriented optical axes to reduce the influences of occlusion.
- use learned models of spatial context in conjunction with a tracker
- STHF descriptor by Charfi et al.

Using multiple cameras

Classified by tasks

3D reconstruction
- volume distribution along the vertical axis, check here.
- multiple cameras + a hierarchy of fuzzy logic to detect falls. Use voxel person (linguistic summarization of temporal fuzzy inference curves)
Multiple viewpoint
- Use LHMM (layered hidden Markov model)
- combining decisions from different camera
- Using the measures of humans heights and occupied areas.

Using depth cameras

Using the distance from the top of the person to the floor
- the distance of human centroid to floor is smaller than threshold and the person does not move for a certain seconds
- Other features like head to floor distance, person area and shape's major length to width
analyzing how a human has moved during the last frames
- characterize the vertical state of a segmented 3D object for each frame. Then compute a confidence by the features extracted on ground event which are fed into an ensemble of decision trees.
- Bayes network on fall detection, including duration, total head drop, maximum speed, smallest head height and fraction of frames for which the head drops
- calculating the velocity based on the contraction or expansion of the width, height and depth of the 3D bounding box.
human body key joints
- 3D body joints at each frame are extracted by randomized decision tree, and the 3D trajectory of the head joint is used to determine whether the fall action has occurred.
- 3D depth for main detection (structure similarity and vertical height of the person), RGB for out-of-the-rage of the depth camera (the width-height ratio of the detected human bounding box is for recognizing different activities).
- an action is represented by a bag of curvature scale space features (BoCSS) of human silhouettes. Or represent an action as Fisher Vector (on CSS). Or check whether the width-height ratio of temporal bounding box is greater than the predefined threshold, if it is, then it's fall. Otherwise check 2D velocity and the 3D centroid information

Conclusions

Methods with cameras

method	merits	demerits
single RGB camera	* no requirement of camera calibration * inexpensive	* case specific and viewpoint-dependent * occlusion problem
calibrated multi-camera systems	* viewpoint independent * No occlusion	* careful and time-consuming calibration * repeatedly calibration if any of them moved
depth camera	* viewpoint independent * No occlusion	* price

Tips for benchmark
- both falls and activities of daily life (ADL) are requried
- include various falls
- include different camera viewpoints: to verify whether the proposed algorithms are viewpoint-independent
- real falls: consider the distribution of volunteers like gender, ages etc.
Tips for fall detection
- combine with Other type of data like sound,

Paper III: 3D depth

3D depth image analysis for indoor fall detection of elderly people, 2015

Existing fall detection methods
- wearable sensor based
  
  small , cheap, wearable, but high drift
- ambient sensor-based
  
  low detection precision of these sensors, and the precision is distance-relied and thus more sensors are required if with a bigger room
- computer-vision based
  
  more robust and less intrusive.
  - methods with a single RGB camera
  - 3d-based methods Using multiple cameras
  - depth camera
Usually shape relative features of human motion analysis and inactivity detection are used as clues for detecting falls.
moment functions are powerful while describing the human shape
- On grey or color images an ellipse is more accurate than a bounding box
- depth camera can offer 3D data so as to work on shape analysis
Threshold based fall detection
- Features that can be used: head-ground distance gap and head-shoulder distance gap, the orientation of the ellipse of the human object and the motion of the human object
- methods: DT, NNs, SVM, Bayesian Belief Network

Goal

deals with the fall detection of the single elderly person in the home environments

How?

Idea

extracting silhouette of moving Individual by extracting background frame
with the horizontal and vertical projection histogram statistics the depth images are converted to disparity map
coefficients of the human body are calculated to determine the direction of individual
threshold based: centroids of the human body to the floor plane and the angle between the human body and the floor plane

Method

Data preprocessing
- depth data: only adopt kinetic data in trusted range (1.2-3.8m)
- extracting silhouette of the moving individual: subtracting the median-filtered background depth frame
- floor plane: disparity map. Floor plane will be a noticeable slant and thick straight line in the disparity map
  
  $= $, for kinetic, $f=580$ pixels, $b=7.5$ cm, and $d$ is the distance between one point in the space TPM the center of the kinetic.
  
  The floor equation then is
- The orientation of human body: after estimating the ellipse of human body on the image plane, then estimate orientation
- calculate the distance from the centroid of the human body to the floor plane and the angles between the body and the floor plane.

Experiments

The dataset is based on their self-captured dataset.-- Cannot find the sharing link of dataset
Only two clips' trajectories are listed, no accuracy data shown

Conclusions

the method may be taken as a candidate
the result of experiments is not convince.

Paper IV: DL

Deep learning based systems developed for fall detection: a review, 2020

the reviewed fall detection systems have some general steps, combined with sensing, data processing, fall event recognition and emergency alert system to rescue the victim.

Goal

presenting a summary and comparison of existing state-of-the-art deep learning based fall detection systems to facilitate future development in this field.

The categorization focuses on how the different principal methods (CNN, LSTM, and AE) handle the event data captured by sensors.

How?

Methods used for fall detection

CNN based fall detection systems

CNN

Adhikari et al : based on videos images from RGB-D camera .
- used their own dataset
- poor sensitivity when the user was in crawling, bending and sitting positions. The system also works in a selected environment. Developed on a single-person scenario
Li et al: extract human shape deformation by CNNs
- used dataset URFD
- not tested on real dataset
Yhdego et al.:
- convert the data from accelerometer to image by continuous wavelet transform
- use transfer learning for these images
- use URFD dataset
Yu et al
- extract human body silhouette by background subtraction. CNN is for preprocessing the silhouette
- codebook background subtraction
Shen et al.
- cloud
- deep-cut NNs to detect the key points of human body, feed key points into NNs
Santos et al
- IoT, fog computing
- used three open datasets
Zhou et al : multi-sensor fusion
- STFT is for extracting the time frequency (TF) micro-motion features
- Two Alexnet and one SSD (single shot multi-box detector) net are for classifying the TF features
He et al.: FD-CNN net
- collected data is mapped into 3-channel RGB bitmap image
- the image plus SisFall and MobiFall datasets for training FD-CNN
Sadreazami et al : CNN for time-series data (radar)
Sortis et al : raw accelerometer data, CUSUM (cumulative sum) algorithm
Lu et al: use optical flow images, pretrained 3D CNN on different dataset, transfer learning. Accurate for single-person detection but may not work well on multi-person detection
Wang et al: tri-axial accelerometer, gyroscope sensor in the smart insole. CNN for improving the accuracy level (it's used directly on the raw sensor data)
Zhang and Zhu: CNN that works on raw 3-axis accelerometer data streams.
Camerio et al: multi-stream model, takes high-level handcrafted feature generators. CNN for optical flow， RGB and human estimated pose. Use URFD and FDD datasets for training
Casilari et al :CNN for recognizing the pattern from the tri-axial transportable accelerometer. Using multiple dataset for training. For real-life performance, LSTM is used
Espinosa et al : Using UP-fall detection multi-modal dataset. Only vision-based .

1D CNN

Cho and Yoon : SVD on accelerometer data for 1Data CNN. Works for triaxial acceleration data.
Tsai and Hsu et al.: feature extraction algorithm for converting the depth image to skeleton information . Only seven highlighted feature points are picked from the skeleton joints. Performs well on NTU RGB+D

3D CNN

Ranhnemoonfar et al.: Kinect depth camera, Adam optimizer. SDUFall dataset for training.
Li et al.: pre-impact fall detection. Pretrained model + new samples for fine tuning.
Hwang et al.: 3D-CNN for continuous motion data from depth cameras. Using data augmentation. Using TST fall detection dataset
Kasturi et al.: kinetic camera. the data fed into 3DCNN is a staked cube. Tested on UR fall detection dataset

FOF CNN

FOF: feedback optical flow convolution

Hsieh and Jeng : IoT, used Feature Feedback Mechanism Scheme (FFMS) and 3D-CNN. KTH dataset.

LSTM based fall detection systems

LSTM with 3D CNN

Lu et al.: trained an extractor only on kinetic data, LSTM+ spatial visual attention. Use Sports-1M dataset, and fall dataset of Multiple Cameras.

LSTM with RNN and CNN

Lie et al.: CNN for extracting 2D skeletons from RGB camera, LSTM for classifying actions. Has an online version
Abobakr et al.: Kinect RGB-D camera. Convolutional LSTM + ResNet for visual feature extraction, LSTM for sequence modeling and logistics regression for fall detection. Use URFD public dataset, can work in real-time
Xu et al.: IoT, acceleration data from a tri-axial accelerometer as input . Outperfoms SVM+CNN model,
Tao and Yun: body posture and human biomechanics equilibrium, fed depth camera data to RNN+LSTM and extracting 3D skeletons. Then by computing center of mass (COM) positions and the region of base support, falls are detected.
Ajerla et al. : used edge devices like a laptop for computing, MetaMotionR sensor (tri-axial accelerometer), subset of the MobiAct public dataset for training,
Queralta et al.: edge computing and fog computing. LSTM+RNN
Luna-Perejon et al.: LSTM+GRU, accelerometer data as input, SisFall dataset for training.
Torti et al.: Micro-Controller Unit (MCU) with Tri-axial accelerometers, LSTM, tensorflow, SisFall dataset
Theodoridis et al. : acceleration measurements, RNN, URFD dataset + random 3D rotation augmentation for training. Comparing 4 type of model: LSTM-Acc, LSTM-Acc Rot, Acc + SVM-Depth, UFT
Hsiu-Yu et al. : kinetics, posture Types, LSTM compared with CNN, fusion images as input from RGB images after extracting body shape by GMM and optical flow.

LSTM with RCN and RNN

Ge et al.: kinect, co-saliency-enhanced RCN, input video clips, RCN + (RNN+LSTM) to label the output.

AE based fall detection systems

Jokanovic et al.: time-frequency (TF) analysis. two stacked AE and a softmax layer.
Droghini et al.: acoustic fall detection, DT (downstream threshold) classifier.
Zhou and Komuro et al.: VAE+3D-CNN residual block, reconstruction error for detecting fall actions. Dataset High-quality simulated fall dataset (HQFD) and Le2i dataset.
Seyfioglu et al.: 3 layer-CAE. unsupervised pretraining for the convolutional layers, radar

Conclusion

Sensors:
- RGB camera: not privacy and thus cannot be used in bathroom which with highly risk. Hacker
- Depth camera: only depth, but can be used in like bathroom
- Accelerator: cheap and the data captured are easy to be used,
Vision based methods really depend on the background

Paper V: 3D skeletons

Implementation of Fall Detection System Based on 3D Skeleton for Deep Learning Technique

Kinect: depth sensor + RGB camera + microphone array
silhouette normalization : sensitive to the background
high computation price

Goal

real-time fall detection

Idea

Methods: seven highlight feature points+ pruned CNNs
- Foreground segmentation: GMM, depth information
- Labeling: use the area size to determine whether the object is human or not.
- Thinning: Use Zhang Suen's rapid thinning method. Dilation + erosion
- Searching: just Using the joints on arms and head, they argue that when a falling event will occur, the joints above the waist will have enormous change.
Input: 7 3D joints in 30 frames, (30，21).Use conv1d (so as to prune the number of parameters)

Experiments

Dataset: NTU RGB+D
Performance
system: 15 frames per second in real-time

Conclusion

depth image for skeleton extraction, conv1d for parameter pruning
15 frames per second for real-time implementation.

Paper VI: 3D skeletons

Human fall-down event detection based on 2D skeletons and deep learning approach

Most of the existing skeleton-based action recognition approaches model actions based on well-designed hand-crafted local features.
- depth value+HMM etc.
- single RGB image + NNs (CNNs, RNNs)

Idea

CNN for extracting skeletons and LSTM for final detection
Skelton extractor:
- input: 2D RGB image (1920*1080)
- DeeperCut skeleton extraction: resnet backbone, output 14 joints (“forehead”, “chin”, and left and right “shoulder”, “elbow”, “wrist”, “hip”, “knee”, “ankle”, the mean of hip, the mean of chin and central-hip. The last two joints are for robustness with respect to background, so as to build a translation-invariant skeleton model).
LSTM:
- input: 8 frames, try to classify 5 actions. Supervised training. (8*28), where (28=2*14)

Experiments

800 training, 255 validation and 250 test. Manually remove the wrong skeletons (incomplete skeletons) by like a threshold (sum of the distances between each joint and the centroid is calculated)
trigger rule: during the last 30 outputs,
- current-time output is “lying” and the previous 19 outputs contain more than 14 “danger” statuses, an alarm will be triggered.
- current output belongs to “safe” status and the previous 8 outputs contain over 5 “safe” statuses, then the alarm signal will be reset.
8 frames per second

Human Activity Recognition

Paper VII

Paper VII: Vision-based human activity recognition: a survey

Goal

review and summarize the progress of HAR systems from the computer vision perspective.

HAR

Related: determining and naming activities using sensory observations
Goal: labeling the same activity with the same label even when performed by different persons under different conditions or styles
Methods: usually by an activity detection task, which includes the temporal identification and localization. Formally, the activity recognition task is divided into classification and detection
HAR systems are influenced by two technologies: contact-based and remote methods.
- contact-based: require the physical interaction of the user with the command acquisition machine or device. Sensors like accelerometers, multi-touch screens etc. But it is not that popular now, cause the complicated sensors and price to make it easily to be accepted and wore
- remote based: vision, societal trust, no requirement of ordinary users and thus non-intrusive
Contributions:
- New update on this rarely focused theme
- thoroughly analysis
- multiple classification methods so as to analyze: detection, tracking and classification stages; feature extraction progress; input data modalities; supervision level; evaluation methods;

Benchmarks

CONVERSE: complex conversational interactions
ALSAN:

HAR approaches

HAR approaches according to feature extraction process

The main steps of handcrafted-based features:
- foreground detection that corresponds to action segmentation
- feature selection and extraction by an expert
- classification of action represented by the extracted features
The spatial and temporal representations of actions
- Based on spatial cues (spatial representation )
  - body models: use kinematic joint model, can be 3D model or directly recognize on 2D model
  - image models: holistic representation of actions that use a regular grid bounded by ROI centered around the person. E.g., silhouettes, contours, motion history images, motion energy images and optical flow.
  - spatial statistics: use a set of statistics of local features from the surrounding regions. E.g., calculated spatial-temporal interest points (STIP) of the image and assign each region to a set of features.
    - volume-based: rely on features like texture, color, posture etc. Recognizing actions by the similarity between two action volumes. Only works fine on simple action or gesture. Like SIFT.
    - trajectory-based: represent joint positions with 2D or 3D points and then track them. The tracked changes in the posture is used for classification. Noise, view and /or illumination changes robust. Like HOG, HOF etc.
- Based on temporal cues (temporal representation )
  - action grammars: represent the action as a sequence of moments, and each moments is described by its own appearance and dynamics. Features are grouped into similar configurations called states and temporal transition between these states are learned. Like HMM, CRF, regression models and context-free grammars
  - action templates: representing the appearance of temporal blocks of features and dynamics called templates. Take representations of dynamics from several frames. Methods like Fourier Transform, Wavelet representations and trajectories of body parts are templates that cane be used
  - temporal statistics: use statistical models to describe the distributions of unstructured features over time.
- Appearance based approaches: can be classified according to either shape or motion based characteristics.
  
  Use 2D or 3D depth images and are based on shape features, motion features or any combination of both features. One advance is the skeleton-based recognition. Methods of it can be classified as (1) joint locations: consider the skeleton as a set of points; (2) joint angles: assume the human body as a system of rigid connected segments and the movement as an evolution of their spatial configuration.
  - shape based methods: local shape features such as contour points, local region, silhouette and geometric features from the Human image or video after foreground segmentation.
  - motion based methods: optical flow and motion history volume from extracting action representation. methods like vector quantization of motion descriptors is used for action recognition. It uses histograms of optical flow and classifiers of bag-of-words.
  - hybrid methods: shape + motion features
Feature learning based methods for extracting representations
- traditional approaches
  - dictionary learning: provides a sparse representation of the input data. It's proved that the use of over-complete dictionaries can produce even more compact representations
  - genetic programming: search a space of possible solutions without having any prior knowledge and can discover functional relationships between features in data enabling its classification. GP is used to construct holistic descriptors that allow to maximize the performance of action recognition tasks.
  - bayesian networks: PGMs, some use PGMs to represent and capture the semantic relationships among action units, and the correlations of the action unit intensities
- deep-learning-based methods
  - generative methods: the main goal of these models is to understand the data distribution including the features that belong to each class. Methods like AE, VAE, GANs
  - discriminative methods: DNN,RNN, CNN. E.g., propose long-term temporal convolutions + high-quality optical flow
  - hybrid models

HAR approaches according to the recognition stages

First stage methods (detection)
- skin color: used to detect the desire body part. May face problems since the chosen color space or when the objects of the scenario whose color is close to that of the skin. Seems not that robust
- ????Shape: the contours of the body part shape. This kind of methods are independent of the camera view, skin color and conditions of lighting.
- pixel values: appearance is represented as pixel values change between images of a sequence according to the activity
- 3d models: build matches between characteristics of the model based on various features of the images. These methods are independent ode the viewpoint
- motion: Using like the difference in brightness of pixels of two successive images
- ????anisotropic-diffusion: based on the extension of the successful anisotropic diffusion based segmentation to the whole video sequence to improve the detection
Second stage methods (tracking)
- temporal based methods: models are used to follow body parts
  - features tracking based on the correlation: try to track the regions that contains body parts. These models require the part being tracked remains in the same neighborhood in the successive images. E.g., use 3D information of depth maps; optical flow
  - contours based tracking: snakes . initially place a contour close to the ROI (region of interest ), then it's warped in an interactive way using active shape models in each frame to make to snake converge. Sensitive to color intensity variations and smoothing and softening of contours
- optimal estimation
  - evaluate the state of moving systems from series of measures. In HAR, they are used to estimate the movements of the human body. In real-time systems. Limitations against cluttered backgrounds.
  - similar to KF
- particle filters
  
  following the body parts and their configurations in complex environments. The location of one body part is modelled by a set of particles
- cam shift
  
  based on mean shift algorithm (it uses the models of appearance based on density to represent the targets).
Third stage methods (classification)
- SVM: used with kernel like Gaussian, RBF or linear kernel
- Naive Bayesian classifier:
- KNN: sensitive to local fatal structure
- Kmeans: sensitive to data structures.
- mean shift clustering: no requirement of prior knowledge about the number of clusters and no limit of the forms
- machines finite state: states (the static gestures and postures) and transitions (temporal and /or probabilistic constraints). However, the models need to be changed everytime a new gesture appears. Computational expensive.
- HMMs:
- dynamic time wrapping: calculates the distances between each pair of possible points from two signals. Used for estimating and detecting the movement in a video sequence
- NNs

HAR approaches according to the source of the input data

Uni-modal methods
- space-time methods: time and the 3D representation f the body to locate activities in space. But sensitive to noise and occlusion. Cannot works on complex actions
- stochastic methods: models like Markov Model. The training is difficult cause the amount of parameters.
- rule-based methods: characterize the activity using a set of rules or attributes
- shape-based methods: shape features to represent and recognize activities. Dependent on the viewpoint, occlusion, people clothing and sensitive to lighting variations
Multi-modal methods
- emotional methods: associate visual and textual features
- behavior methods: recognize the behavior methods.
- social networks-based methods: allow recognition of social events and interactions

HAR approaches according to the ML supervision level

supervised: mostly used to classify and recognize short term actions
unsupervised: outperform on finding spatio-temporal patterns of motion. computationally complex, less accurate and trustworthy.
semi-supervised: hybrid.

Activities type

Divided by the complexity

elementary human actions: simple atomic activities
gestures: a language or part of the non-verbal communication which can be employed to express significant ideas or orders
behaviors: a set of physical actions and reaction of individuals in specific situations
interactions: reciprocal actions or exchanges between two entities or more
group actions:
events: social actions between individuals

Body parts

hand gestures: be tracked to detect the communication between individuals
foot: detect shifting and movements of people or Other actions
facial expressions: interpret specific Types of human activities, especially for handicapped or disabled people
full body: posturers and human actions by the whole body

The Types of Input Data

image or videos

HAR on static images: when activity is distinguishable compared to others by its characteristic.
HAR by videos: offer extra information related to prior and post event, then the relation between two successive frames can be established

single viewpoint or multi-view acquisition

single view acquisition
multiple view acquisition

Evaluations

Validation means

One platform, validated on different Types of datasets (differ on actions) acquired by this platform.
Different platforms, different datasets but all be evaluated so as to test the generalization across datasets
Compared with results in literature

Datasets (Benchmarks)

Evaluation metrics

sensitivity: also known as TP, recall or probability of detection. It determines the failure of the system to detect actions

$Sensitivity =\frac{TP}{TP+FN}$
precision: also known as PPV (positive prediction value). $(1-precision)$ determines the probability of the recognizer incorrectly identifying a detected activity

$Precision=\frac{TP}{TP+FP}$
Specificity: also known as false positive rate (FPR).

$Specificity=\frac{TN}{TN+FP}$
Negative predictive value (NPV): also known as negative precision. It measures the system sensitivity to negative class.

$NPV=\frac{TN}{TN+FN}$
F_measure: the harmonic mean of precision and recall. It gives information of tests accuracy. Best with 1 and worst with 0.

$F_measure=2 $
Accuracy: the percentage of correct predictions relative to the total number of samples

$Accuracy =\frac{\#CorrectPredictions}{\#Predictions}\\Accuracy =\frac{TP+TN}{\#samples}$
Likelihood ratio: the likelihood of an activity predicted when it matches the ground truth compared to the likelihood when it's predicted wrongly.

$LR+=\frac{Sensitivity}{1-Specificity},\\ LR-=\frac{1-Sensitivity}{Specificity}$
AUC: it's 1 if predicted perfectly.
Confusion matrix
IoU: intersection over Union, also known as Jaccard index or Jaccard similarity coefficient.

$IoU=\frac{AreaOfOverlap}{AreaOfUnion}$

Limitations and Challenges

Limitations

Show various issues that may affect the effectiveness of HAR system.

specific to the methods used during the various phase of the recognition process.
- methods based on the form or the appearance
  - like colorimetric segmentation: confuse the objects of the scene and body parts
  - variation in appearance or clothing of people
related to the acquisition devices, experimentation environment or various applications of the system.
- light variations: affect the image quality and then features
- perspective change: if data is acquired by single view it would be a big problem
  - self-occlusion: body parts occlude each other
  - occlusion of another object
  - partial occlusion of human body parts
- variety of gestures linked to the complex structure of human activities and the similarity between classes of different actions.
  - due to data association
- noise, complex or moving backgrounds and unstructured scenes, and scale variation
rely on their own recorded dataset to test performance
- call of a benchmark

Challenges

the requirement of continuous monitoring and generate reliable answers at the right time
modeling and analyzing interactions between people and objects with an appropriate level of accuracy
societal challenges: acceptance by the society, privacy, side effects of installation
- privacy: HAR systems on smartphones may be a way out
HAR system should be independent on users' age, color, size or capacity to use
gestures independence and gestures spotting from continuous data streams
- detect and recognize various gestures under different background conditions
- tolerant with the scalability and growth of gestures
context-aware: improve applications in its domain
daily life activities: complex videos and hard to be modeled; overlapping of starting and ending time of each particular activity; discrimination between intentional and involuntary actions
HAR through missed parts of video, recognition more than one activities performed by one person at the same time, early recognition and prediction of actions
The implementation of DL HAR system: memory constraint, high number of parameters update, collection an fusion of large multi-modal variant data for the training process, deployment of different architectures of DL based methods in smartphones or wearable devices