This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 1

Fall Detection using Smartphone Audio Features Michael Cheffena

Abstract—An automated fall detection system based on smartphone audio features is developed. The spectrogram, mel frequency cepstral coefficents (MFCCs), linear predictive coding (LPC) and matching pursuit (MP) features of different fall and no-fall sound events are extracted from experimental data. Based on the extracted audio features, four different machine learning classifiers: k-nearest neighbor classifier (k-NN), support vector machine (SVM), least squares method (LSM) and artificial neural network (ANN) are investigated for distinguishing between fall and no-fall events. For each audio feature, the performance of each classifier in terms of sensitivity, specificity, accuracy and computational complexity is evaluated. The best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity and accuracy all above 98%. The classifier also has acceptable computational requirement for training and testing. The system is applicable in home environments where the phone is placed in the vicinity of the user. Index Terms—Fall detection, smartphone, audio features, ehealth, assisted living.

I. I NTRODUCTION As the world’s aging population grow, an increasing attention has been given in the last years to develop healthenabling technologies and ambulatory monitoring systems for the elderly [1]. According to the World Health Organization, approximately 28-35% of people aged 65 and above fall each year, increasing to 32-42% for those above 70 years of age [2]. In fact, falls are known to increase exponentially with age-related biological changes in the body leading to high incidence of falls and fall related injuries in the aging society [3]. Developing a robust fall detection system which help alleviate this major health issue, is thus an important research topic which is beneficial to the society in general and the elderly in particular. The main objective of a fall detection system is to quickly alert when a fall event occurs to reduce the time spent laying on the ground after a fall. This is important as long lies without assistance may lead to hypothermia, dehydration, bronchopneumonia and pressure sores [4], [5]. In addition to providing rapid assistance after a fall, it has also a direct impact on reducing the fear of fall which in turn increases the risk of fall [6]. The fear of fall by itself has been shown to significantly decrease the quality-of-life of an individual leading to less physical activity, falls, depression and decreased social contact [7]. In the last years, a number of fall detectors has been reported in the literature and can be broadly categorized into contextaware systems and wearable devices [3]. In the context-aware systems, sensors (such as cameras, floor sensors, infrared sensors, microphones and pressure sensors) are deployed in the environment to detect falls, and their main advantages is that the subject is not required to wear them [8]–[11]. However, their performance is limited to the environments where the

sensors are previously deployed [9]. For systems based on wearable devices, the subject is required to wear sensors under or on top of clothing. Most of them are based on accelerometer sensors [12]–[15] with some of them incorporating gyroscope sensors [16]. Recently, mobile phone based fall detection systems have been emerging [17]–[23] with all of them utilizing the phones built-in accelerometer and/or gyroscope sensor. They require users (e.g., the elderly) to wear their mobile phones at a specific body location (chest, waist, thigh or back) to be able to detect a fall. However, in home environments, mobile phones are not usually worn on the body but placed in the vicinity of the user. In these cases, the aforementioned mobile phone based fall detection systems will not work. In this study, we purpose a fall detection system based on smartphone audio features. The system is applicable in home environments where the phone is placed in the vicinity (within a distance of around 5 meters) of the user. Four different machine learning classifiers are investigated using various audio features extracted from experimental data of fall and no-fall sound events. The paper begins in Section II by discussing various audio features that are considered for detecting a fall event. In Section III, four different machine learning classifiers are presented for use in discriminating between fall and no-fall events based on the extracted audio features. Experimental data and performance analysis investigating which audio feature and classifier are best suited for detecting a fall using a smartphone audio signal are presented in Section IV. Limitations to the proposed smartphone based audio fall detection system are discussed in Section V. Finally, conclusions are given in Section VI. II. FALL AND NO - FALL AUDIO FEATURES An automatic smartphone based audio fall detection system must have a proper signal feature that is likely to result in effective discrimination between fall and other no-fall daily activities. Generally, the input audio signal is unstructured and is composed of contributions from a variety of different sources. Thus, specific audio features within the input acoustic signal need to be extracted for detecting a fall event. The appropriate choice of these features is vital for developing a robust fall detection system. In the following, audio features such as the spectrogram, mel frequency cepstral coefficents (MFCCs), linear predictive coding (LPC) and matching pursuit (MP) are considered in this study for detecting a fall event. These features are extracted from experimental data of fall and no-fall sound events collected using smartphones at a sampling frequency of 44.1 KHz, see Section IV for details on the measurement campaign.

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 2

A. Spectrogram A spectrogram represents an acoustic time-frequency intensity representation of a sound signal. It has been widely used in sound recognition and classification [24]. The spectrogram features of an audio signal might be used for detecting a fall. The features are extracted by first dividing the signal into overlapping short time segments enabling greater visibility of the frequency changes with time. A window function is used on each segment to minimize the boundary effects due to segmentation. A typical widow used for this purpose is the Hamming window [25], given by ( πn ) (1) w(n) = 0.54 − 0.46 cos N where n is the window time and N = 2a is the window size for any integer a. A short window size provides better time resolution whereas a longer one results in an improved frequency resolution. The relationship between the amount of overlap and window size determines the computational time and the frequency-time resolution of the resulting spectrogram. In our case, various window sizes and overlapping percents were considered with regard to accuracy and computational complexity relevant for real-time applications. A window size of N = 1024 and 32 percent overlap were found to provide a reasonably good estimate of the changes in the audio signal with acceptable computation time. Then, a fast Fourier transform (FFT) is applied to each time window as ( ) N −1 ∑ −j2πhn X(h) = w(n)x(n) exp (2) N n=0 where x(n) is the input audio signal. Parameter (h = ) s 0, 1, ..., N − 1 corresponds to the frequency fr (h) = hf N where fs is the sampling frequency in Hertz equal to 44.1 kHz in our case. Finally, to approximately partition the range of X(h) into a set of bins, each X(h) value is transformed into power energy using ( ) P (h) = 10 × log10 |X(h)|2 (3) For each audio signal, a set of 1×N/2+1 (i.e. 1×1024/2+1) P (h) feature values are generated (note that the spectrum is symmetrical around N/2).

where m is the time, x(m) is the input signal, y(m) is the filter output signal and η ∈ [0.95, 1]. In this study we used η = 0.96. 2) Segmentation and Windowing: The audio signal is segmented into a number of overlapping frames and a Hamming window is used to reduce the boundary effects caused by segmentation. A window size of 1024 with 32 percent overlap is also used here as they provide a good estimate of the MFCCs features with reasonable computation time. 3) FFT: For each frame, an FFT is applied to obtain the frequency spectral features of the audio signal. 4) Mapping and Filtering: The mel-scale mapping is applied for each windowed frame using [28] ( ) fm Mel(fm ) = 1125 ln 1 + (5) 700 where fm is the mel scale frequency range from 0 to 22 KHz. The FFT values are multiplied to a bank of 30 triangular filters uniformly spaced on the mel scale range to produce filter-bank energies, El for l = 1, 2, ..., 30. 5) Discrete cosine transform (DCT): is applied to filter-bank energies to create J MFCCs features for frame i, i.e, ( ) L ∑ (j + 1)(l − 0.5)π Cj,i = cos El (6) L l=1

where L = 30 is the number of triangular filters used in step 4 and j = 1, 2, ..., J. From our data, J = 13 is found to be optimal from which no significant increase in accuracy is observed with increasing number of coefficients. 6) Step 4 and 5 are repeated for all frames to obtain the MFCCs feature matrix of the sound signal. C. Linear Predictive Coding (LPC) LPC features have been widely used in speech recognition [25], [29]. The main idea of LPC is that a given sound signal sample at time m, x b(m), can be estimated as a linear combination of the past Ps samples as [30] x b(m) =

Ps ∑

ap x(m − p)

(7)

p=1

B. Mel Frequency Cepstral Coefficents (MFCCs) MFCCs are features widely used for speech/sound recognition. They describe the sound signal’s energy distribution in a frequency field based on the Mel frequency scale [26] and might be used for detecting a fall. They are often considered best for speech recognition as they accord with human hearing characteristics and have good anti-noise ability [27]. They are computed in six steps [8]: 1) Preemphasis: The acoustic signal is first preemphasised using a first-order high-pass filter with preemphasis coefficient, η, to compensate for the attenuation of the high frequency signal component. The filter is defined as y(m) = x(m) − ηx(m − 1)

(4)

where ap is the pth LPC coefficient and Ps is the LPC order. The following are the basic steps for extracting the LPC features of a sound signal as described in [30]: 1) Preemphasis: The sound signal is first preemphasised using a first-order high-pass filter given in (4) to compensate for the attenuation of the high frequency signal component (similar to step 1 of the MFCCs feature extraction discussed in Section II-B). 2) Segmentation and Windowing: The audio signal is segmented into a number of frames with overlapping. Good estimates of the audio feature with reasonable computation time are achieved using a window size of 1024 with 32 percent overlap similar to the one used for computing the spectrogram and MFCC features. The

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 3

Hamming window defined in (1) is used to minimize the boundary effects caused by segmentation. 3) Auto-correlation analysis: Each frame of the windowed signal is auto-correlated as ri (p) =

N∑ −1−p

xei (n)xei (n + p)

(8)

n=0

where p = 0, 1, ..., Ps , n and N are the window time and size, respectively as defined in (1). Parameter xei (n) and xei (n + p) are the segmented and windowed audio signal of frame i at time n and n + p, respectively. 4) LPC parameters: The auto-correlation coefficients of each frame are then converted into LPC parameters using Ps ∑

r (|p − v|) ap = r(p), 0 ≤ v ≤ Ps

(9)

p=1

where the LPC coefficients, ap , can be obtained using the Durbin’s recursion technique [31]. See [30] for more details on the algorithm. In our analysis, the LPC filter order is set to Ps = 48 which is consistent with the Markel and Gray [32] recommendation were the number of coefficients are suggested to be equal to the sampling rate in KHz (44.1KHz in our case), plus 4 or 5 additional coefficients. D. Matching Pursuit (MP) The MP algorithm was originally introduced by Mallat and Zhang [33] for decomposing signals in an over-complete dictionary of functions. It has been used in different applications such as video coding [34], music note detection [35] and environmental sound recognition [36]. The algorithm provides a way to select a small set of basis vectors that produces meaningful audio features which might e.g., be used for detecting a fall. In the following the description of the MP algorithm is given based on [36]. Given a dictionary DMP which is a collection of parametrized waveforms expressed as DMP = {ϕγ : γ ∈ Γ}

(10)

where ϕγ is called an atom and Γ is a parameter set. The approximate decomposition of a signal s is obtained by s=

B ∑

αγb ϕγb + R(B)

(11)

b=1

where R(B) is the residual. The goal is to find indices γb and compute ϕγb while minimizing R(B) . The algorithm builds up a sequence of sparse approximation stepwise starting with initial approximation of s(0) = 0 and residual R(0) = s. First, all inner products of the signal s with atoms in the dictionary DMP are computed. The atom with the largest inner product magnitude, ϕγ0 , is selected as the first element. Atom ϕγ0 is then subtracted from s to yield residual R(0) . At different stages z = 1, 2, ..., the algorithm finds the atom with highest correlation with the residual, and adds the scalar multiple of that atom to the current approximation as s(z) = s(z−1) + αγz ϕγz

⟨ ⟩ where αz = R(z−1) , ϕγz and R(z) = s−s(z) . After B steps, a representation of the approximate decomposition is obtained as shown in (11). As in [36], we used a dictionary of Gabor atoms in order to encapsulate the non-stationary characteristics of the sound signals. The discrete Gabor time-frequency atom is expressed as 2 2 K G(m) = √ e−π(m−u) /s cos (2πω(m − u) + θ) (13) s where s ∈ ℜ+ ; u, ω ∈ ℜ, and θ ∈ [0, 2π]. Parameter K is a 2 normalization factor such that ∥G∥ = 1. Parameters s, u, ω, θ correspond to an atom’s position in scale, time, frequency, and phase, respectively. The Gabor dictionary in [33] was implemented with atom parameters obtained from dyadic sequences, s = 2pg (for 1 ≤ pg ≤ B). See [33] for details on the algorithm. We used a window size of 1024 with 32 percent overlap. We decompose each 1024-point segment using MP with a dictionary of Gabor atoms that are also 1024 points in length. The parameters of the Gabor function used are: s = 2pg (for 1 ≤ pg ≤ 8), u = {0, 64, 128, 192}, ω = Ke2.6 (1 ≤ e ≤ 35), K = 0.5 × 35−2.6 and θ = 0. We try to keep the dictionary size as small as possible without affecting the accuracy since a large dictionary demands higher computational complexity which may not be relevant for real-time applications. Based on our experimental data (discussed in Section IV), no noticeable improvement in accuracy is observed by further increase of the above given Gabor parameter values. III. C LASSIFIERS Four different machine learning classifiers are considered for distinguishing between fall and no-fall events based on the extracted audio features discussed in Section II. The audio features are represented by R×S matrix where R is the feature length within each frame and S is the number of frames in a given sound signal. The data length is 4s × 44.1KHZ = 176400 points. Using a window size of 1024 points with 32 percent overlap, the number of frames in a sound signal is [(176400 − 0.32 × 1024) / (0.32 × 1024)] = 537. Thus, for the audio features discussed in Section II, R = 513, S = 537 for spectrogram; R = 13, S = 537 for MFCCs; R = 48, S = 537 for LPC; and R = 1024, S = 537 for MP, see Section II. All the classifiers are implemented in MATLAB environment. A brief distribution of each classifier is followed in this section. A. k-Nearest Neighbor Classifier (k-NN) This method classifies an event based on the closest k training events [37]. For k = 1, the classifier assigns an unknown event to the event class that has a sample closest to it. The Euclidean distance of the sample features can be used to measure the closeness of an unknown event to classes of events. In our case, the Frobenius norm is used as a measure of the distance between pairs of audio features as [8] v u R S u∑ ∑ (O − Q )2 (14) d =t rs

O−Q

(12)

rs

r=1 s=1

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 4

where O and Q are feature matrices of two sound events with dimension R × S. For simplicity, the k = 1 classifier is used in our case. For each feature type (i.e. spectrogram, MFCCs, LPC and MP), the minimum Frobenius norms between a test feature matrix and a training no-fall and fall feature matrix are first computed using (14), designated as min (dtest−nofall training ) and min (dtest−fall training ), respectively. The hypotheses H1 (fall) is then chosen if min (dtest−nofall training ) >1 min (dtest−fall training )

(15)

Otherwise a null hypotheses, H0 , (no-fall) is chosen. B. Support Vector Machine (SVM) SVMs are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis [38]. It is based on the concept of decision planes which define the decision boundaries of different classes based on the training data. Giving some training data D of fall and no-fall sound events, a set of g points of the form D = {(xd , yd ) | xd ∈ ℜ, yd ∈ {−1, 1}}

(16)

where xd (for d = 1, ..., g) is a vector of feature observations (i.e. the R × S audio feature matrix discussed above is transformed into a column vector xd of observations with g being the feature length) and yd is either 1 or -1 specifying the class of the observations (fall and no-fall). Hyperplanes linearly separating the two classes can then be defined as wT xd + b ≥ 1 for yd = 1 and wT xd + b ≤ −1 for yd = −1 where w is the weighting vector and b is the bias [39]. The goal is to maximize the distance between these two hyperplanes by minimizing ∥w∥ which can be formulated as a quadratic optimization problem expressed as ( ) min ∥w∥ subject to yd wT xd + b ≥ 1 for d = 1, ...., g (17) w

The discriminant function can then separate the two categories (fall and no-fall) as ( ) f (xd ) = sign wT xd + b (18) where f (xd ) = +1 for all xd above the boundary (i.e. when yd = 1) and f (xd ) = −1 for all xd below the boundary (i.e. when yd = −1). C. The Least Squares Method (LSM) In this method, the different audio feature matrices are first transformed into a column vector of observations. Then, for each feature type, a training data-set is used to create two average feature reference vectors corresponding to fall and nofall events (calculated by taking the average of fall and no-fall event features). The sums of the squared difference between a test vector t = [t1 , ..., tg ] with g being the feature length and each of the reference vectors corresponding to fall and no-fall events r = [rf 1 , ..., rf g ], for f = 1, 2, are calculated as [37] ϵ2f =

g ∑ d=1

(td − rf d )

(19)

For a given feature type, a decision is made based on which reference feature vector (corresponding to fall or no-fall event) results in minimum difference with the test feature vector (i.e. class decision is made by minimizing ϵ2f ). D. Artificial Neural Network (ANN) In ANN, a set of independent processing units which receive inputs through weighted connections are used to solve specific problems [40]. Data classification can be obtained by training the network through a learning process using fall and nofall audio features. The number of hidden neurons within the network affects the accuracy of the classifier, and theoretically estimating this number is rather difficult. In our case, several numbers of hidden neurons were tested. The highest accuracy was achieved when the number of hidden neurons was set to 10. Increasing this number resulted in a gradual decrease of the accuracy which might be due to the over-fitting effect during the neural training process. In addition, since there are two event classes (fall and no-fall), the output layer is composed of 2 neurons. If an input audio feature pattern belongs to a fall event, then the output of the corresponding neuron will be equal to one and the output of the other neuron will be equal to zero. Events are then classified at the output as: IF output1 ≥ output2 then fall, else no-fall

(20)

Thus, a feedforward ANN with 10 hidden neurons and 2 outputs neurons is implemented using the Neural Networks Toolbox in MATLAB environment and was trained using the scaled conjugate gradient method. IV. E XPERIMENTAL DATA AND ANALYSIS A. Measurement data Nine volunteers (2 female and 7 male) were recruited to perform and record different fall events in their home environment using their smartphones placed in their vicinity (within a distance of around 5 meters). Refer to Table I for details on the volunteers and the smartphone type used in the experiment. Young volunteers are used in the experiment as it is difficult to recruit elderly volunteers to perform fall exercises due to fear of injuries. Before the experiment, the volunteers were trained and instructed by a researcher to fall according to fall characteristics discussed in [41]. Three fall classes: fall from sleep, fall from sitting and fall from standing were considered. The noise in the environment (such as TV, radio, talking, etc.) was not controlled during the experiment. Each volunteer performed five trials within each fall class (except one volunteer which did not perform the five trials of fall from sleep) by falling into a carpet resulting in total of 130 fall events, see Table II. The volunteers used their own carpet to fall into a floor of mainly wood and concrete types. For each trail, the phone was placed at random locations within the vicinity of the user. Other sounds of no-fall daily activities of different actions (a total of 130) are also recorded (by one individual) using Samsung Galaxy S3 smartphone, see Table III. All audio signals (fall and no-fall) were recorded at a sampling frequency of 44.1 KHz with duration of four

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 5

TABLE I VOLUNTEER AND SMARTPHONE DETAILS

2) Specificity: the capacity of the system to detect falls only when they occur

#

Gender

Age

Height (cm)

Weight (kg)

Phone type

1

Female

21

171

75

Samsung S5

2

Female

22

159

55

Samsung S3

3

Male

20

180

80

Samsung S5

4

Male

21

189

95

LG Optimus E975

5

Male

21

180

80

HTC One (M7)

6

Male

21

184

75

Samsung S4

7

Male

23

183

80

Samsung Note 2

8

Male

27

186

80

Samsung S3

9

Male

37

170

82

Samsung S3

TABLE II FALL SOUND EVENT DATABASE Activity

Number of trails

Fall from sleep

40

Fall from sitting

45

Fall from standing

45

Total fall events

130

TN × 100 (22) TN + FP 3) Accuracy: the ability of the system to differentiate between falls and no-falls TP + TN Accuracy = × 100 (23) TP + FN + FP + TN where TP refers to true positive (i.e. a fall occurs and the algorithm detects it), TN is the true negative (a fall does not occur and the algorithm does not detect a fall), FP is the false positive (a fall does not occur but the algorithm reports a fall) and FN is a false negative (a fall occurs but the algorithm does not detect it). In general, the system must minimize the FPs and FNs. Especially, falls should not be misclassified as some other activity. Thus, by all means, false negatives must be avoided at all times. Specificity =

C. Results and discussions seconds. A total of 130 fall and 130 no-fall events were recorded in home environments. Each sound signal is processed in MATLAB to extract the different audio features discussed in Section II. These features are further used for training and testing the different machine learning classifiers presented in Section III. B. Performance Metric Index A reliable fall detection algorithm has to make a decision whether a fall has occurred based on distinct features of an audio signal. The following three success criteria could be used to measure its performance: 1) Sensitivity: the capacity of the system to detecting falls. It is the ratio of true positives to the number of falls Sensitivity =

TP × 100 TP + FN

(21)

TABLE III N O - FALL SOUND DATABASE Activity

# Trails

Activity

# Trails

Walking

5

Typing keyboard

5

Talking

5

Rubbing

5

Running

5

Cans rolling

5

Watching TV

5

Plastic rolling

5

Cooking food

5

Sitting on chair

5

Eating food

5

Sleeping on bed

5

Drinking water

5

Sleeping on a couch

5

Washing dishes

5

Walking on stairs

5

Phone ringing

5

Opening window

5

Books falling

5

Closing window

5

Cans falling

5

Shaking a key

5

Plastic falling

5

Door knocking

5

Ball falling

5

Microwave noise

Total no-fall events

5 130

The objective of the analysis is to investigate which audio feature and which classifier are suitable for detecting a fall using a smartphone audio signal. For each audio feature, the performance of each classifier in terms of sensitivity, specificity, accuracy and computational complexity is evaluated. From our experimental data there are 130 fall and 130 nofall events resulting in total of 260 sound events, see Tables II and III. A 10-fold cross validation is employed where the fall and no-fall sound events are randomly split into 10 equal size partitions with 26 sound events randomly assigned in each group. The 9 partitions are used for training the classifiers and the remaining 1 partition is used for testing/validation purposes. The training and testing process is repeated 10 times so that each of the 10 groups are used exactly once as the testing data. The results are then averaged over the 10 validations to yield the average performance. The advantage of this method is that each sound event in the data-set gets a chance of validation (i.e. being used as a test signal). Tables IV to VII show the performance of the different classifiers using the spectrogram, MFCCs, LPC and MP audio features, respectively. For the spectrogram features, best performance is achieved utilizing the ANN classifier followed by the SVM, k-NN and LMS classifiers, see Table IV. For the MFCCs and LPC features, the k-NN classifier gives best performance followed by the SVM, ANN and LMS classifiers, see Tables V and VI. For the MP features, best performance is achieved utilizing the ANN classifier followed by the LMS, SVM and k-NN classifiers, see Table VII. For each audio feature, the computational requirements of the four machine learning techniques are also evaluated (see the last two rows of Tables IV to VII) in terms of the training and testing times required for a single fold of the data-set. The algorithms are implemented in MATLAB 8.2.0.701 (R2013b) environment on a Windows 7 computer with a 1.73 GHz quad core 64-bit Intel Core i7 processor and 8 GB of RAM. In terms of the required training time, the classifiers can be sorted in increasing order as k-NN, LMS, SVM and ANN for the

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 6

TABLE IV P ERFORMANCE RESULTS USING SPECTROGRAM FEATURES Metric

k-NN

SVM

LMS

ANN

TP

17.30

17.10

14.50

17,40

TN

20

20.40

18.30

21.10

FP

1.40

1

3.10

0.30

FN

0.30

0.50

3.10

0.20

Sensitivity

98.49%

97.46%

82.17%

98.97%

Specificity

93.55%

95.46%

85.71%

98.49%

Accuracy

95.64%

96.15%

84.10%

98.72%

Computation time (s)

Fig. 1.

Typical spectrogram features of a fall sound event.

Training

0

0.427

0.006

0.989

Testing

0.112

0.026

0.001

0.038

TABLE V P ERFORMANCE RESULTS USING MFCC S FEATURES Metric

k-NN

SVM

LMS

TP

14,10

13.90

0

ANN 9.70

TN

18.20

16.90

19.40

17.60 1.80

FP

1.20

2.50

0

FN

5.50

5.70

19.6

9.90

Sensitivity

72.78%

71.66%

0%

49.26%

Specificity

93.42%

87.41%

100%

90.74%

Accuracy

82.82%

78.97%

49.74%

70%

Computation time (s) Training

0

0.288

0.034

27.40

Testing

1.491

0.046

0.023

0.043

TABLE VI P ERFORMANCE RESULTS USING LPC

Fig. 2. Spectrogram features of a no-fall sound event. A person talking in the room.

spectrogram, MFCCs, LPC and MP features. In terms of the testing time, the order is as LMS, SVM, ANN and k-NN for the spectrogram and LPC features, LMS, ANN, SVM and kNN for the MFCCs features, and SVM, ANN, LMS and k-NN for the MP features. The overall best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity, and accuracy all above 98%, see Table IV. The false negatives are also minimal thus preventing falls from being misclassified as some other activity. The classifier also has acceptable computational requirement for training and testing. Figures 1 and 2 show examples of the spectrogram features of a fall and no-fall (a person talking) sound events, respectively. V. L IMITATIONS There are some limitations to the smartphone based audio fall detection system presented above which need to be considered in future studies. Since young volunteers are used in the experiment, the fall characteristics may not be totally similar to the actual falls by the elderly. There might also

FEATURES

Metric

k-NN

SVM

LMS

TP

11.40

11.50

0

ANN 5.40

TN

19.30

18.70

20.10

18.10

FP

0.80

1.40

0

2

FN

7.50

7.40

18.90

13.50

Sensitivity

60.73%

61.23%

0%

28.03%

Specificity

96.21%

92.92%

100%

89.88%

Accuracy

78.72%

77.44%

51.53%

60.26%

Computation time (s) Training

0

0.449

0.097

105.27

Testing

5.599

0.091

0.083

0.098

TABLE VII P ERFORMANCE RESULTS USING MP

FEATURES

Metric

k-NN

SVM

LMS

ANN

TP

11.80

10.20

20.60

18.6

TN

8.40

14.90

8

17.7

FP

10

3.50

10.40

0.70

FN

8.80

10.4

0

2

Sensitivity

57.98%

49.53%

100%

89.74%

Specificity

46.22%

80.88%

43.39%

96.51%

Accuracy

51.79%

64.35%

73.33%

93.08%

Computation time (s) Training

0

32.59

13.41

112.17

Testing

96.25

1.75

2.35

2.22

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 7

be other fall events where e.g., the person slumps slowly on the ground with no apparent impact. These type of falls could not be detected by the audio-based system. Furthermore, blind source separation (BSS) techniques such as [42], [43] could be used to increase the performance of the system in noisy environments. The system could also be extended to include algorithms for fall recovery recognition able to detect whether a person is recovered from a fall event or if he/she is still lying in the ground. Finally, the maximum distance of the subject to the smartphone is limited (around 5 meters) and the system may not work when the person is e.g., in a different room. Thus, additional microphones installed in different rooms of the house connected to the smartphone by wireless links, might be considered. VI. C ONCLUSION In this study, a fall detection system based on smartphone audio features is developed. The system is applicable in home environments where the phone is placed in the vicinity of the user. The spectrogram, MFCCs, LPC and MP of different fall and no-fall audio features are extracted from experimental data. Based on the extracted audio features, four different machine learning classifiers where investigated. For each audio feature, the performance of each classifier in terms of sensitivity, specificity, accuracy and computational complexity is evaluated. The best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity and accuracy all above 98%. The classifier also has acceptable computational requirement for training and testing. Future work includes long-term assessment of the system performance in different home environments. ACKNOWLEDGMENT The author would like to thank Gjøvik University College for supporting this work and the volunteers for participating in the measurement campaign. R EFERENCES [1] W. Ludwig, K. H.-. Wolf, C. Duwenkamp, N. Gusew, N. Hellrung, M. Marschollek, M. Wagner, and R. Haux, “Health-enabling technologies for the elderly - an overview of services based on a literature review,” Computer Methods and Programs in Biomedicine, vol. 106, pp. 70–78, May 2012. [2] World Health Organization, “Global report on falls prevention in older age,” Available: http://www.who.int/ageing/publications/Falls prevention7March.pdf, 2007. [3] R. Igual, C. Medrano, and I. Plaza, “Challenges, issues and trends in fall detection systems,” BioMedical Engineering Online, vol. 12, no. 66, pp. 1–24, 2013. [4] L. Z. Rubenstein and K. R. Josephson, “The epidemiology of falls and syncope,” Clinics in Geriatric Medicine, vol. 18, no. 2, pp. 141–58, May 2002. [5] M. D. Tinetti, W. L. Liu, and E. B. Claus, “Predictors and prognosis of inability to get up after falls among elderly persons,” Journal of the American Medical Association, vol. 269, no. 1, pp. 65–70, May 1993. [6] S. M. Friedman, B. Munoz, S. K. West, S. G. Rubin, and L. P. Fried, “Falls and fear of falling: Which comes first? A longitudinal prediction model suggests strategies for primary and secondary prevention,” Journal of the Americal Geriatrics Society, vol. 50, pp. 1329–1335, 2002. [7] A. C. Sheffer, M. J. Schuurmans, N. van Dijk, T. van der Hooft, and S. E. Rooij, “Fear of falling: measurement strategy, prevalence, risk factors and consequences among older persons,” Age Ageing, vol. 37, pp. 19–24, 2008.

[8] Y. Li, K. C. Ho, and M. Popescu, “A microphone array system for automatic fall detection,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 2, pp. 1291–1301, May 2012. [9] C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Robust video surveillance for fall detection based on human shape deformation,” IEEE Transaction on Circuits and Systems for Video Technology, vol. 21, no. 5, pp. 611–622, May 2011. [10] G. Mastorakis and D. Makris, “Fall detection system using Kinects infrared sensor,” Journal of Real-Time Image Processing, pp. 1–12, March 2012. [11] H. W. Tzeng, M. Y. Chen, and J. Y. Chen, “Design of fall detection system with floor pressure and infrared image,” In Proceedings of the International Conference on System Science and Engineering, pp. 131– 135, 1-3 July, Taipei 2010. [12] U. Lindemann, A. Hock, M. Stuber, W. Keck, and C. Becker, “Evaluation of a fall detector based on accelerometers: A pilot study,” Medical Biological Engineering Computing, vol. 43, no. 5, pp. 548– 551, September 2005. [13] T. Zhang, J. Wang, L. Xu, and P. Liu, “Fall detection by wearable sensor and one-class SVM algorithm,” In Lecture Notes in Control and Information Science, Springer, vol. 345, pp. 858–863, 2006. [14] C. Douckas, I. Maglogiannis, F. Tragkas, D. Liapis, and G. Yovanof, “Patient Fall Detection using Support Vector Machines,” International Federation for Information Processing, vol. 247, pp. 147–156, 2007. [15] J. Cheng, X. Chen, and M. Shen, “A framework for daily activity monitoring and fall detection based on surface electromyography and accelerometer signals,” IEEE Transactions on Biomedical and Health Informatics, vol. 17, no. 1, pp. 38–45, January 2013. [16] Q. Li, J. A. Stankovic, M. Hanson, A. Barth, and J. Lach, “Accurate, fast fall detection using gyroscopes and accelerometer derived posture information,” In Proceedings of the 6th International Workshop on Wearable and Implantable Body Sensor Networks, pp. 138–143, 3-5 June, Berkeley 2009. [17] F. Sposaro and G. Tyson, “iFall: an Android application for fall monitoring and response,” In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6119–22, 3-5 September, Minneapolis 2009. [18] J. Dai, X. Bai, Z. Yang, Z. Shen, and D. Xuan, “Mobile phone-based pervasive fall detection,” Personal and Ubiquitous Computing, vol. 14, pp. 633–643., 2010. [19] I. C. Lopes, B. Vaidya, and J. Rodrigues, “Towards an autonomous fall detection and alerting system on a mobile and pervasive environment,” Telecommunications systems, vol. 48, pp. 1–12., 2011. [20] M. V. Albert, K. Kording, M. Herrmann, and A. Jayaraman, “Fall classification by machine learning using mobile phones,” PLOS One, vol. 7(5):e36556, May 2012. [21] R. Y. W. Lee and A. J. Carlisle, “Detection of falls using accelerometers and mobile phone technology,” Age Ageing, vol. 0, pp. 1–7, 2011. [22] S. H. Fang, Y. C. Liang, and K. M. Chiu, “Developing a mobile phonebased fall detection system on android platform,” In Proceedings of the Conference on Computing, Communications and Applications, pp. 143– 146, 11-13 January, Hong Kong 2012. [23] S. Abbate, M. Avvenuti, F. Bonatesta, G. Cola, P. Corsini, and A. Vecchio, “A smartphone-based fall detection system,” Pervasive and Mobile Computing, vol. 8, pp. 883–899, December 2012. [24] P. Khunarsal, C. Lursinsap, and T. Raicharoen, “Very short time environmental sound classification based on spectrogram pattern matching,” Information Sciences, vol. 243, pp. 57–74, September 2013. [25] Thiang and S. Wijoyo, “Speech recognition using linear predictive coding and artificial neural network for controlling movement of mobile robot,” International Conference on Information and Electronics Engineering (IPCSIT), vol. 6, 28-29 May, Bangkok 2011. [26] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transaction Acoustics Speech and Signal Processing, vol. 28, pp. 375–366, August 1980. [27] W. Yutai, L. Bo, J. Xiaoqing, and L. Feng, “Speaker recognition based on dynamic MFCC parameters,” International Conference on Image Analysis and Signal Processing (IASP), pp. 406–409, 11-12 April, Taizhou 2009. [28] O. O’Shaughnessy, “Speech Communication: Human and Machine,” Reading, MA: Addison-Wesley, p. 1210214, 1987. [29] M. Hariharan, L. S. Chee, O. C. Ai, and Y. Sazali, “Classification of speech dysfluencies using LPC based parametrization techniques,” Journal of Medical Systems, vol. 6, pp. 1821–1830, 2012. [30] L. Rabiner and B. H. Juang, Fundamentals of speech recognition. Prentice Hall, New Jersey, USA, 1993.

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2015.2425932, IEEE Journal of Biomedical and Health Informatics 8

[31] M. H. Hayes, Statistical digital signal processing and modeling. John Wiley Sons, Inc., New York, NY, USA, 1996. [32] J. D. Markel and A. H. Gray, Linear prediction of speech. SpringerVerlag, 1976. [33] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transaction on Signal Processing, vol. 41, no. 12, pp. 3397–3415, December 1993. [34] R. Negg and A. Zakhor, “Very low bit rate video coding based on matching pursuits,” IEEE Transaction on Circuits and Systems for Video Technology, vol. 7, no. 1, pp. 158–171, February 1997. [35] R. Gribonval and E. Bacry, “Harmonic decomposition of audio signals with matching pursuit,” IEEE Transaction on Signal Processing, vol. 51, no. 1, pp. 101–111, January 2003. [36] S. Chu, S. Narayanan, and C. C. J. Kuo, “Environmental sound recognition with Time-Frequency Audio Features,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, August 2009. [37] R. O. Duda, P. E. Hart, and D. G. Stok, Pattern Classification. 2nd

Edition, John Wiley Sons, Inc., New York, NY, USA, 2001. [38] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [39] V. Jakkula, “Tutorial on support vector machine (SVM),” School of EECS, Washington State University, 2006. [40] S. Haykin, Neural Networks and Learning Machines. Prentice Hall, Upper Saddle River, New Jersy, NJ, USA, 2009. [41] X. Yu, “Approaches and principles of fall detection for elderly and patient,” 10th International Conference on e-health Networking, Applications and Services, pp. 42–47, 7-9 July, Singapore 2008. [42] K. Rahbar and J. P. Reilly, “A frequency domain method for blind source separation of convolutive audio mixtures,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 832–844, September 2005. [43] J. Liu, J. Xin, Y. Qi, and F. Zeng, “A time domain algorithm for blind separation of convolutive sound mixtures and L1 constrained minimization of cross correlations,” Communications in Mathematical Sciences, vol. 7, no. 1, pp. 109–128, 2009.

2168-2194 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Fall Detection Using Smartphone Audio Features.

An automated fall detection system based on smartphone audio features is developed. The spectrogram, mel frequency cepstral coefficents (MFCCs), linea...
582KB Sizes 10 Downloads 9 Views