IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

1773

Neural Network Approaches for Noisy Language Modeling Jun Li, Karim Ouazzane, Hassan B. Kazemian Senior Member, IEEE, and Muhammad Sajid Afzal

Abstract— Text entry from people is not only grammatical and distinct, but also noisy. For example, a user’s typing stream contains all the information about the user’s interaction with computer using a QWERTY keyboard, which may include the user’s typing mistakes as well as specific vocabulary, typing habit, and typing performance. In particular, these features are obvious in disabled users’ typing streams. This paper proposes a new concept called noisy language modeling by further developing information theory and applies neural networks to one of its specific application-typing stream. This paper experimentally uses a neural network approach to analyze the disabled users’ typing streams both in general and specific ways to identify their typing behaviors and subsequently, to make typing predictions and typing corrections. In this paper, a focused time-delay neural network (FTDNN) language model, a time gap model, a prediction model based on time gap, and a probabilistic neural network model (PNN) are developed. A 38% first hitting rate (HR) and a 53% first three HR in symbol prediction are obtained based on the analysis of a user’s typing history through the FTDNN language modeling, while the modeling results using the time gap prediction model and the PNN model demonstrate that the correction rates lie predominantly in between 65% and 90% with the current testing samples, and 70% of all test scores above basic correction rates, respectively. The modeling process demonstrates that a neural network is a suitable and robust language modeling tool to analyze the noisy language stream. The research also paves the way for practical application development in areas such as informational analysis, text prediction, and error correction by providing a theoretical basis of neural network approaches for noisy language modeling. Index Terms— Backpropagation, correction, first three hitting rate, focused time-delay, n-gram, prediction, probabilistic neural network, time gap, typing stream.

I. I NTRODUCTION

T

HE goal of statistical language modeling (SLM) [49] is to build a statistical language model that can estimate the distribution of natural language as accurately as possible. Language model assigns probabilities to sequences of symbols or words, and is used in many natural language processing

Manuscript received February 11, 2012; accepted May 11, 2013. Date of publication July 5, 2013; date of current version October 15, 2013. This work was supported by Disability Essex and Technology Strategy Board, U.K. J. Li is with the Gray Institute for Radiation Oncology and Biology, University of Oxford, Oxford OX3 7DQ, U.K. (e-mail: [email protected]). K. Ouazzane, H. B. Kazemian, and M. S. Afzal are with the Faculty of Computing, London Metropolitan University, London N7 8DB, U.K. (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2263557

applications such as speech recognition and data prediction. Information theory [7], [29] is one of its branches to quantify information. Informational data is, however, usually noisy. With regard to typing stream, this refers to typing mistakes and correction actions appeared in the stream. A user’s typing stream generated from using computer QWERTY keyboard [1] is a reflection of the user’s typing behavior that includes the user’s typing mistakes, vocabulary, typing habit, and typing performance. Computer users inevitably make errors and their typing stream implies all users’ self-rectification actions. For example, research shows that disabled keyboard users have various performances and make various mistakes such as long key press, adjacent key press, and so forth [24]. Here is an example I’mmmmmm z sutdent I’m

a student.

The first line is a disabled user’s typing stream shown in Notepad, whereas the second is the correct reference. This short typing stream includes a long key press error, a missstroke error, and a dyslectic typing mistake, which can be seen as noises added into a clean text. All these can be corrected or filtered based on a specific analysis to the type of errors, or a general historic data analysis without considering any particular error genre. In language modeling area, quite a few of research were carried out such as N-gram prediction and prediction by partial matching [6], which are applied to clean text efficiently, but the usage on active text with noisy data is hardly found. Therefore, this paper puts forward a new concept called noisy language modeling, which is used to estimate the probabilities of a set of symbols or words based on noisy historical data rather than clean text. Neural network is a statistical analysis tool that has the ability to learn from a collection of examples to discover patterns and trends, wh the goal of SLM is to build a language model for estimating the distribution of natural language as accurately as possible [49]. Neural network language modeling uses particular neural network models to capture the salient statistical characteristics of the distribution of sequences of text in a natural language, typically allowing us to make probabilistic predictions of length of text given the preceding. The current research methodology in this area mainly focuses on twofold, which are prediction based on intelligent algorithms or statistics, or prediction based on natural language

2162-237X © 2013 IEEE

1774

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

processing related to syntactic and semantic analysis [12]. For example, [6] used a statistical method called partial string matching based on Shannon’s information theory [28] to show that mixed-case English text can be coded in as little as 2.2 b/character with no prior knowledge of the source. It was further demonstrated in [8] that the statistical methodology by developing an interface (i.e., [45]) incorporating language modeling. Xu and Rudnicky [41] argued the most common and widely used way (i.e., N-gram models) for SLM and suggested that the neural network can learn a language model and have an even better performance. Thereafter, some researches were carried out based on neural networks. Holger and Jean [26] used a neural network model to predict words based on the large corpora learning, Mikolov et al. [22] presented a new recurrent neural network-based language model with applications to speech recognition, and [20] prioritized the word lists generated for text prediction based on an integrated neural network model. Language streams such as a typing stream are, however, usually noisy as illustrated above. Although some researches are carried out in the language modeling field related to noise using statistics and natural language processing (e.g., spell checker applications), neither those statistical nor neural network models have sufficiently considered noise tolerance in the analysis, which reassert the claim raised by Xu and Rudnicky [41] with just one word—noisy added, i.e., a strange phenomenon is that in spite of the popularity of artificial neural networks, it is hard to find any work on noisy language modeling using neural network in the literature. In addition, most of the pertinent work focuses solely on correcting one type of the problem while ignoring others [47]. Very little work attempted to analyze noisy language streams as a whole based on its distinct features. In addition, these applications such as spell checkers are not designed specifically for complex errors such as errors made by disabled users, which are mostly not spelling mistakes. In particular, related to the user typing stream analysis, some applications are short of self-adaptive ability (i.e., learning ability), and fail to fully recognize the right patterns from user’s distinct performances [30], [31], [34]. Thus, there is a need to develop models with learning ability to analyze noisy user typing stream. Time-delay neural network (TDNN) that employs time delays on connections in feedforward networks, was introduced in [38] and then an adaptive version of TDNN was proposed in [9]. As TDNN is successfully applied in many areas such as speech recognition [16], temperature prediction [32], and nonlinear system identification [42]. Although it was tipped in [2] to use TDNNs for having a better effect in language modeling, seldom research is found in this domain. Probabilistic neural networks (PNN) derived from Bayes decision networks is a type of radial basis network suitable for classification problems. It was used in various areas such as, core classifiers to medium scale speaker recognition [37], computer-based face detection systems [19], and game playing [11], which demonstrated its distinct characteristics such as, incremental training, robustness to noisy examples, and learning capacity [39]. Rarely in the SLM, it was suspected whether [43] used it to fight the curse of dimensionality by learning a distributed representation for words. Although its

capacity was fully demonstrated, unfortunately noise analysis was beyond the goal of [43]. In general, the aim of this paper is mainly threefold: 1) it is intended to introduce a new concept, i.e., noisy language modeling in language modeling field; 2) provides pilot modeling approaches by applying distinct neural network models to this new field, meantime tests the capacity and robustness of the neural networks; and 3) in practice, it gives an alternative solution in text prediction and error correction. In this demonstration, some distinct neural networks including focused time-delay neural network (FTDNN) and PNN are applied as learning statistical tools to process one of the scenarios of noisy language stream—typing streams, identify the patterns that a particular person holds, and provide users with some distinct functions. The rest of this paper is organized as follows. Section II-A describes the experimental data sets used by the models. Section II-B describes an extendible FTDNN language model performed with noise-free, noisy, and typing stream data sets. In Section II-C, based on the study with regards to the influence of time gap on user’s typing performance, a prediction using time gap (PTG) model is developed to correct wrong symbols. In Section II-D, a model based on the PNN is developed to simulate a specific user typing behavior. Section II-E summarizes the models and their performances. Finally, in Section III, conclusion and suggestions for future work are summarized. II. N EURAL N ETWORK M ODELS D EVELOPMENT A. Experimental Data Sources The main experimental data sets used in this paper are described as follows. 1) Data Set 1: A novel—Far from the Madding Crowd [15] by Thomas Hardy [1874]. The version used here is extracted from Calgary Corpus with a size of 751 kb in text format [3]. A piece of sample is shown as follows: I don’t think it is for you, sir, said the man, when he saw Boldwood’s action. Though there is no name I think it is for your shepherd. 2) Data Set 2: It is extracted from Disability Essex (U. K.) helpline keystroke logs [44]. The associated computer is used by a disabled volunteer as a question recording, database retrieval, and email composition tool. The keystroke recording tool used in this research is KeyCapture software [33], which is modified and adjusted for this paper. An example is shown as follows: 1

2

3

4 5

6

7

8

9

Columns 3–6, 8, and 9 refer to action date and time (ms), keys pressed, keys status (up or down), the value of virtual key code, distance between two keys of a standard keyboard, and time gap between two consecutive key presses, respectively.

LI et al.: NEURAL NETWORK APPROACHES FOR NOISY LANGUAGE MODELING

3) Data Set 3: Typing samples from some people with Parkinson disease or motor disability are also gathered. The segment used by intelligent models is as follows:

1775

...text sequence... n-gram

...

1-gram

0-gram

ORIGIN/TARGET − the quick brown fox jumped over the lazy dog TYPED − hthe quick brrooownn fgow jummppefd iobverethe lwqazy dooggfg ORIGIN/TARGET refers to a typing reference or called target data set, and TYPED refers to a user’s typing sample shown in Notepad editor. These are recorded in a plain text format. The associated keystrokes are also tracked and saved by KeyCapture.

0-TD

x

1

1-TD

x

2

. . .

. . . m

n-TD input

x

output

FTDNN Fig. 1.

Presentation of FTDNN n-gram language modeling process.

B. FTDNN Language Modeling 1) FTDNN and n-Gram Prediction: FTDNN is suitable for time-series prediction. Studying user’s typing behavior would require the network to study user’s history and trace back length of context to some extent, so-called n-gram [5], to predict the next probable occurrence of symbols. [17] demonstrated that the FTDNN is very reliable in response to time and memory requirement. A comprehensive research on FTDNN language modeling with noisy data stream is, however, rarely found. N-gram prediction can be achieved using adjusted time delays of FTDNN model and correction can be achieved in the same way by considering the correction as a type of predictions, which produces the right symbol based on the noisy history. Here, an extendable FTDNN ngram prediction model, which combines FTDNN and n-gram prediction method, is developed to predict noise-free, noisy, and typing stream data sets. The FTDNN n-Gram prediction definition [21]is defined as follows: let us assume existing string S = {s1 .si .s j s j +1 .sk .s p |1 ≤ i ≤ j < j + 1 ≤ k ≤ p} and ( j − i + 1) = n, (k − j ) = l, where s1 , si , s j , s j +1 , sk , s p are symbols and i, j, k, l, n, p are natural numbers, if one builds a relation Rn = {x, y|x = (si . . . s j )n → y = (s j +1 . . . sk )l }, then the relation is defined as n-gram’s l− prediction; if one considers the special case l = 1, then the relation is called n-gram’s one-prediction, or n-gram prediction for short. For example, given string S = {student}, some 2-gram prediction cases are st → u tu → d … en → t. In this paper, only n-gram’s one-prediction is used, based on which the model using FTDNN is called FTDNN n-gram model. To weight the experimental results, some concepts are introduced here. First hitting rate (HR) and first three (FT) HR refer to the probability falling into the top rank and top three ranks, respectively, in a descending sequence, while in general the probability falling into x rank is called level x HR. Fig. 1 shows the defined FTDNN N-gram model which is shown with its language modeling logic and the relation between n-gram, FTDNN, and level x HR. Studying user’s typing behavior requires the network to study user’s history and trace back certain length of context (i.e., n-gram) to predict

the next probable occurrence. Here, n time delays correspond to n-gram. Adding one more gram requires one more time delay. Variable m represents the number of the language symbol set as well as the number of output neurons, while a level x HR refers to x rank, and its set related to the symbol set is generated in a testing iteration. For instance, let us assume a symbol set {a, b, c, d, e} that also corresponds from first to fifth output neurons, an input string abc, expected output string ace, and three consecutive outputs shown in probabilities as a b c d e 0.05 0.78 0.20 0.10 0.01 0.05 0.08 0.30 0.10 0.01 0.10 0.02 0.02 0.35 0.03 Then one has m = 5. For the calculation of level x HR, one has to constitute the unary code by converting the xth biggest value to one and the rest to zeros amongst the output set. For a particular case, to calculate the level 1HR, i.e., first HR which follows the winner takes all rule–the neuron that has the biggest value among the outputs is set to one while others are set to zeros, the unary codes of the example are converted into a 0 0 0

b 1 0 0

c 0 1 0

d 0 0 1

e 0 0 0.

That is, the predicted string is bcd. For the calculation of level 2 HR, the unary codes are converted into a 0 0 1

b 0 0 0

c 1 0 0

d 0 1 0

e 0 0 0.

That is, the predicted string is cda. A level x HR is calculated based on the comparison between the predicted string and expected output string divided by the length of the string. Then the first HR of the example is     compare  bcd  − ace /length  bcd  = 33.33% (1) where compare function returns the number of identical symbols between two strings and length function returns the length

1776

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

of the string. In the same way, the level 2HR of the example is zero. In general the level x HR set is

0.35

HR = {hr x |hr x = compare (E − Px ) /length (E) , 1 ≤ x ≤ m} (2)

FT H R =

3 

hri .

(3)

0.3 First Hitting Rate

where hr x , E, Px refer to level x HR, expected result, and the predicted result of level x HR, respectively. For the FT HR which is the sum of the top three ranks, the equation is

0.25 0.2 0.15

1-GRAM 2-GRAM 3-GRAM 5-GRAM 7-GRAM 9-GRAM 11-GRAM

0.1

i=1

0.05 0 0

10

20

30

40 50 60 70 Hidden Neurons

80

90

100

First Three Hitting Rate with N-Gram

0.7 0.6 First Three Hitting Rate

Apparently the sum of the HR set—HR is 100%. In practice, the model uses a 27−n−27 three-layer structure with 27 input neurons, 27 output neurons, extendible numbers of hidden layer neurons, and extendible numbers of time delays at the input to perform one-step-ahead predictions. A symbol set with 27 elements, A = {a . . . z, space}, is applied to simplify the data set one. All capitals are converted to lowercase, while subsequently the other symbols not in set A are converted to spaces. During the modeling, a chunk of data ranging from zero to 100 k is selected from the data set one. The data set is subsequently divided into training data, validation data, and testing data. Both input and output are encoded in unary code [13]. In this paper, all experiments are carried out based on Lenovo T60 (IBM) platform (Intel Core2 CPU T5600 @ 1.83GHz, 3.00GB of RAM, Hard disk 120 GB), Windows XP Professional 2002 operating system, and M ATLAB Version 7.4.0 and its neural network toolbox. During the FTDNN model training and testing using data set one, the numbers of grams—[1, 2, 3, 5, 7, 9, 11, 13] which are represented by time delays, and the numbers of hidden neurons—[1, 2, 3, 5, 7, 9, 15, 25, 50, 100] are cross designed and implemented. Subsequently, as the gram reaches 11 and the number of hidden neurons reaches 100, or as the gram reaches 13 and the number of hidden neurons reaches 15 or onward, the memory of current system is beyond its limit. Therefore, the experimental results are abandoned from G−11 and H −100 onward. As shown in Fig. 2 both plots have [1, 2, 3, 5, 7, 9, 11] grams associated with various number of hidden neurons. It is evident that 2-, 3-, and 5-gram give the best three first HRs while 1-, 2-, and 3-gram give the best three FT HR. (Note: in a small margin 2-gram gives the best FT HR and 3-gram gives the best first HR.) From both plots of Fig. 2, the lower grams (1, 2, and 3) show a better convergence toward the maximal HRs (i.e., 3-gram’s first HR is around 37% and FT HR is around 58%). Both figures show that lower HRs occur from 4-gram onward. This proves that the more the historical data set input the more the learning neural network space is needed, and the more training is needed toward the convergence. Under the current training sample the results suggest that there is a best gram with certain number of hidden units to suit the prediction best. Beyond a critical point of prediction rate, further increase of gram or hidden unit does not help to achieve a better performance. It also indicates that the number of neurons in hidden layer affects the model’s learning ability and HR For instance, the number of neurons in hidden layer should

First Hitting Rate with N-Gram

0.4

0.5 0.4 0.3 1-GRAM 2-GRAM 3-GRAM 5-GRAM 7-GRAM 9-GRAM 11-GRAM

0.2 0.1 0 0

Fig. 2.

10

20

30

40 50 60 70 Hidden Neurons

80

90

100

N -gram first and FT HR curves.

not be too small to a structured symbol set {a, . . . , z, space} distribution; otherwise, it would be difficult for neural network to reach a good HR. The black curve displayed in Fig. 2 shows that the 11-gram testing stops at fifty neurons; a 27−100−27 three-layer FTDNN model has failed to complete the training process under the current system environment because of its error of running out of memory. The hierarchical HRs generated by this model can be used by prediction ranking functions. 2) N-Gram Prediction With Noises: To further test the FTDNN N-gram model’s prediction ability, a noisy randomization method within the data preprocessing function is designed and applied to the training, validation, and testing data sets of the data set one. The noises are randomly distributed into the data set one after the data conversion into unary code. Fig. 3 shows the variation of first HR (in blue) and FT HR (in red). In the last experiment, 2- and 3-gram obtain the best two HRs with fifty hidden neurons, so FTDNN 2- and 3-gram models are used, respectively, here. Fig. 3 also shows the prediction curves as the noise rate increases from 0 to 0.3.

LI et al.: NEURAL NETWORK APPROACHES FOR NOISY LANGUAGE MODELING

2-Gram with fifty neurons prediction

0.65

1-Hitting Rate Total Hitting Rate

0.6

Hitting Rate

0.55 0.5 0.45 0.4

0.3 0.25 0

0.05

0.1

0.15 0.2 Noise Rate

0.25

0.3

0.35

3-Gram with fifty neurons prediction

0.65

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0

0.05

0.1

0.15 0.2 Noise Rate

0.25

This is a typical adjacent key press error usually made by some people with motor disability or Parkinson disease. Through training, the FTDNN model is able to learn 2-gram prediction rules between the predecessor and successor, for instance d →e

1-Hitting Rate Total Hitting Rate

0.6

Hitting Rate

to predict user’s next typing intention. As the typing data stream is a typical noisy data set which includes user’s typing mistakes as well as self correction strokes such as symbols backspace and delete, the model will not only learn the habits of user using language but also learn the self-correction actions which occur in typing stream. For example, a self-correction action from a wrongly typed word desj to the expected word desk can be broken down in a typing data stream as follows: d = > e = > s = > j = > backspace = > k…

0.35

Fig. 3.

1777

0.3

0.35

2-and 3-gram HR curves under noise rates.

The noise rate is calculated through NoiseRate = c/s, where c and s refer to number of noisy symbols and the length of the input string. It is applied by randomly locating the symbols of the input string and assigning each symbol to another random value within the range of the symbol set A. Both figures show a decreased HR as the noise rate increases. For example, when the noise rate reaches the value of ∼30%, its corresponding first HR is only 27% compared with the prediction rate of 37% without noise. Fig. 3 shows that the maximal FT HR (58%, in 2-gram) occurs when noise rate is zero, whereas the minimal FT HR (45%, in 3-gram) occurs at the rate of 0.3, which is the lower boundary of noise rate. In general, the prediction rates decrease linearly with the noise rates increasing. Although there is undulation, it is not evident that the models are more suitable for streams with particular noise rates. 3) N-Gram Prediction With Typing Data: As analyzed, the designed FTDNN n-gram models have shown that it can be applied to noisy data prediction with a high capability. Here, a user’s typing data stream data set two is applied to the models. The user’s typing history is analyzed by the model

e → s. From the typing stream shown, the model learns not only the existing noises such as s → j, but also the correction actions such as j → backspace. In practice, users only need to continue their typing without stopping in spite of the mistakes, while the model should be able to correct the mistakes automatically or specify recommendations later on through the predicted actions. The collected data stream in data set two is expressed in virtual key codes [48]. In this research only editing virtual keys are adopted, other keys such as arrows are discarded. Then, the size of symbol set originally used by the model is extended into 53 individual symbols, which apart from alphabets also include some other symbols such as VK_BACK => BACKSPACE key VK_RETURN => ENTER key VK_SHIFT => SHIFT key VK_DELETE => DEL key. With the original design of FTDNN N-gram model, an extension to 53 units both at the input and output layers is made whereas other part of the structure is unchanged. The data set two has recorded both the key press down status and up status. Considering some disabled people specific typing behavior such as prolonged key press which would generate more down keys corresponding to one up key, the keystrokes with down status are selected by the preprocessing function for neural network training and testing. A comparison among the gram set [1, 3, 5, 7, 9] based on various numbers of hidden neurons—[3, 5, 7, 9, 15, 25, 50, 100] is shown in Fig. 4. The first plot shows a comparison of the grams’ first HRs with an increase of hidden neurons. The second plot is a comparison of FT HRs between different grams. Fig. 4 shows that 1-gram produces the maximal FT HR of 53% whereas 3-gram with fifty hidden neurons produces the maximal first HR of 38.1%. Similar results are obtained using data set one; the lower grams (1-, 2-, and 3-gram) show a

1778

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

C. Prediction Using Time Gap

First Hitting Rate

0.4

Hitting Rate

0.35

0.3

0.25 1-GRAM 3-GRAM 5-GRAM 7-GRAM 9-GRAM

0.2

0

10

20

30

40 50 60 70 Hidden Neurons

80

90

100

First Three Hitting Rate Aggregation 0.55 0.5

Hitting Rate

0.45 0.4 0.35 0.3

1-GRAM 3-GRAM 5-GRAM 7-GRAM 9-GRAM

0.25

   − Vmin )/(Vmax − Vmin ) + Vmin (4) v  = (v − Vmin ) ∗ (Vmax

0.2 0

Fig. 4.

10

20

30

40 50 60 70 Hidden Neurons

80

90

1) Time Gap Modeling: From [10], users’ input performance represented by variable IP in bits/s is proportional to the variable ID, i.e., movement time, which has a direct relation with the moving distance from one point to another. Let us consider a standard keyboard layout, the time gap between two consecutive strokes directly depends upon the distance between those two keys. As observed, the last key’s position represented by the distance and angle with the target key could affect some of the disabled users’ judgment on their typing accuracy and speed, which is reflected by the time gap recorded on the computer log. Given the user’s typing history, a neural network model named as time gap neural network (TGNN) is designed here to simulate and predict the two consecutive typing letters’ time gap, using data set two as it is experimental data set. A 54 virtual key codes set is considered, which includes all 53 symbols used in the previous experiment, e.g., alphabets and space. The other symbols which appear in data set two but do not belong to the 53 symbols set are classified as a designed symbol—other. The data preprocessing only extracts the keystrokes whose time gaps are in a range of [0, 3000] ms, while the rest which is considered as either out of range or computer system related problems are ignored. 2-gram data set is created with their corresponding time gaps. This requires 108 (number of symbols * gram) neurons in the input layer. All the time gap values are normalized into the range of [−1, 1] according to the min–max normalization [14]

100

[1, 3, 5, 7, 9] gram typing stream HRs.

better solution using FTDNN model prediction under current circumstances. Both data sets demonstrate a highly accurate prediction rate (FT HR, ∼50%) with FTDNN model. 4) FTDNN n-Gram Modeling Summary: This paper develops a FTDNN n-gram model with extendible numbers of hidden layer neurons and extendible numbers of time delays to analyze noise-free, noisy, and user’s historical typing data. Approximately 50% FT HR is obtained from experimental results. The model captures a user’s typing patterns through evolutionary learning. The experimental results can be used to predict users’ typing intention. In practice, a higher prediction rate could be achieved by combining the FT HR with an English word dictionary. As the typing stream includes all the users’ correction actions and the predicted next symbol could be delete or backspace, the experimental results can also be used to correct users’ typing mistakes. Both tests with data sets one and two show minimal number of hidden neurons is required to get a good HR, but the testing also demonstrates the gram uncertainty in producing a best FT HR (e.g., 2-gram shown in Fig. 2 and 1-gram shown in Fig. 4). Therefore, a combination of 1-, 2-, and 3-grams potentially could be designed for an optimal solution.

  = 1, Vmin = −1 and variable v is the time gap where Vmax value extracted from data set two. The results of TGNN model are reversed to their natural values based on the same equation. As the backpropagation algorithm has advantage of a relatively simple implementation and the data set has no requirement on the time dimension and the scale of the data set is relatively small, the traditional backpropagation neural network [35] is chosen for this research. The algorithm in general is regarded as being slow and inefficient in some research area but has no detrimental effect to this test. The model was designed with a 108−7−1 three-layer structure using Levenberg–Marquardt optimization algorithm by a prior experimental and empirical selection in numbers of middle layers and neurons, while further considering the simplicity and performance. The input includes two consecutive symbols represented by unary codes, and the output is the expected time gap between these two consecutive symbols. The tangent sigmoid and linear transfer functions are considered, respectively, as the hidden and output layer’s activation function. A reconstructed data set extracted from data set two is used as the neural network’s training and validation data set. Another two data sets, abcdefghijklmnopqrstuvwxyz, in an alphabetical order and qwertyuiopasdfghjklzxcvbnm, in a QWERTY keyboard layout order are used as two testing cases. The modeling results based on the two data sets are shown in Figs. 5 and 6. The TGNN model is trained based on data set two. Then the alphabetical and QWERTY sequences are applied to

LI et al.: NEURAL NETWORK APPROACHES FOR NOISY LANGUAGE MODELING

The time gap of typing A -Z 700 600 Time gap (ms)

500 400 300 200 100 0 a b c d e f g h i j k l mn o p q r s t u v w x y z Alphabet

Fig. 5.

Modeling time gap using A → Z sequence.

The time gap of Qwerty sequence 600

Time gap (ms)

500 400 300

200 100

0 qwe r t y u i o p a s d f g h j k l z x c v b nm

Qwerty layout Fig. 6.

Modeling time gap using QWERTY sequence.

the model. Fig. 5 shows a simulation of the user’s typing behavior (e.g., speed and time gap) should the user types an alphabetical sequence; Fig. 6 shows a simulation of the user’s typing behaviors (e.g., speed and time gap) should the user types a QWERTY sequence. Because of no predecessors, both corresponding time gaps of the first keystrokes in sequence (in Fig. 5 is a; and in Fig. 6 is q) are counted as zero. In Figs. 5 and 6 x-axis represents user’s typing sequence; y-axis represents the time gap in milliseconds. Between each two consecutive alphabets, a blue line is drawn to show the elapsed time. The maximal time gap (637.4 ms) occurs in Fig. 5 when the finger moves from key x to y; while the minimal time gap (89.9 ms) appears in both figures, when the finger moves from j to k. These two figures show that the current keystroke’s predecessor does affect the user’s typing behavior (e.g., time gap) if one ignores the user’s keystroke action itself and behavior randomicity that human may have. Because of the distance difference between each two keys in computer QWERTY keyboard, the time gap of each two consecutive keys during user strokes varies. The red lines in Figs. 5 and 6 represent the average time cost of all 25 movements, which show that the cost of typing an

1779

alphabetical order sequence is 384.44 ms (see Fig. 5), whereas the cost of typing a QWERTY order sequence is 342.50 ms (see Fig. 6). The test shows that typing an alphabetical sequence is more time consuming with a standard keyboard. This can be explained by movement cost, meaning that an alphabetical order sequence would require more time for user to locate the keys from one to another. This paper explores the idea that the time gap between two consecutive keystrokes is influenced by current symbol’s predecessor, with the user’s personal characters embedded. A further research tracing back more gram history with a larger data set is necessary. The physical mobility control and energy cost can be involved to find the right patterns among movement direction, typing symbols composition, and keyboard layout. Subsequently, researchers may be able to discover a convenient, energy saving, fixed or adaptive keyboard layout for users. 2) Prediction Using Time Gap: People with motor disability or Parkinson disease using keyboard may press adjacent keys or stick keys. These can be shown from the time gap between each two consecutive key strokes. For example, a time gap between the windows keyboard messages caused by sticking keys can be much smaller than the user’s normal typing speed; the opposite case may also happen when more time can be spent by disabled people aiming at the target before making up their mind. From observation, interestingly it is rare for those people who completely miss to type a symbol. According to these distinct behaviors, a neural network model using backpropagation is designed by adding an extra time gap variable in the input layer, called PTG. Here, a small sample typed by a Parkinson person as shown in data set 3 is used to demonstrate the idea. The typed sample is reconstructed for preprocessing @the quick [email protected]@@[email protected] @@[email protected]@ [email protected]@[email protected] @@[email protected] the [email protected]@azy [email protected]@@@ where the symbol @ represents an error or a NULL, compared with the correct sample that should be recognized by PTG model. During preprocessing, the time gap value that is one of the input parameters is categorized into three levels and converted into three bits unary codes. In this case ‘< = 10 milliseconds’ over-fast = > 001 ‘10 010 ‘>1000 milliseconds’ over-slow => 100. The user’s typing results are recorded both by Notepad and KeyCapture software. The PTG model is designed with three layers 30−7−28 structure, where the input is 27 b unary code within the symbol space {a . . . z, space} and three bits unary code of time gap, and the output is twenty-eight length unary code within symbol space {a . . . z, space, @}, where the symbol @ is added to represent an additional or missed symbol in the typing stream. The correction rate distribution within one hundred times training is shown in Fig. 7, which has a mean value of 0.8480 and a deviation of 0.0501. The x-axis represents the correction rate based on the comparison between the target data set and

1780

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

35

set around keyi within j key distances. For instance, a one key distance set corresponding to key s is, Cs,1 = {s1 |s} ≈ { D  , E  , W  , A , Z  , X  }. Noisy data models are not only used to analyze a language text, but also can be explored to analyze some specific problems. Let us consider the helpline data—data set two as a modeling scenario, which shows a typist is frequently making hitting adjacent key errors mistakes. Before modeling, the mistakes are extracted from data set two. A sample of hitting adjacent key errors is shown as follows:

Error correction training results of a sample

Absolute Frequency

30 25 20 15 10 5 0 0.65

Fig. 7.

0.7

0.75 0.8 0.85 Correction Rate

0.9

0.95

Absolute frequency of PTG model correction rate.

the PTG model prediction. The y-axis represents the absolute frequency of the one hundred times training results, which shows the number of times a particular outcome occurs. Fig. 7 also shows the range that PTG model’s correction rate lies on. It shows that the results lie predominantly between 65% and 90%. Under this test sample there is about 27 times where the correction rate has reached near 90% and only once the correction rate happens to be less than 65%. This test shows that the time gap can be considered as an input element used by neural network model to correct wrongly typed symbols. Because of no gram consideration and the size limitation of training data set, the relationship built between input and output is a pure right–wrong relationship, which will lead to a further research on the n-gram language modeling with larger training and testing data sets. D. PNN Modeling 1) Assumption: The research carried out in this section is based on one finger typing user case. A user’s each key press and move relies entirely on a single finger. Skilful users’ typing behavior in controlling fingers may vary and the distance of fingers move between two consecutive keystrokes could be more complex. 2) Key Distance Definition: According to the layout of a computer QWERTY keyboard, there exists a physical distance between each two keys. Let di, j be the distance between key i and key j, and define the measure unit as key distance. Then, da,s = 1 shows that the distance between key A and key S is one key distance; da, f = 3 means there are three key distances between key A and key F. Users move their fingers toward the next key as soon as they finish current key press. The distance between two keys affects a user’s typing performance. 3) Error Margin Distance Definition: Based on Key distance, a variable ds, f is further defined as a distance between a user’s typed key—keys and target key—key f and called error margin distance. The error margin distance is mainly caused by the user’s hitting adjacent key error. 4) Key Distance Class Definition: Let us define a class, Ckeyi , j = {keyi j |keyi }, by giving keyi , keyi j ∈ {key1 , . . . , keyn }, where i, j ≤ n, n is the number of keys related to a computer QWERTY keyboard, keyi j is a key

"Q"Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(*) "S"Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(*) "BACK"Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(*) "D"Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(*) which is a typical hitting adjacent key errors typing mistake that occurred within a user’s typing stream. The user’s intention is to type a letter d following letter q, but the letter s is mistakenly pressed. Therefore the user has to go back and make a correction by pressing backspace key shortly after the mistake is made. (Note: in virtual key code, the backspace is represented by BACK.) Both the key distance and the time gap are calculated and recorded in the log. The user investigation shows users’ hitting adjacent key behavior is related to the positions of both the last key and the current key if one ignores the stroke randomicity that users’ symptoms may cause. It also shows that a user’s typing speed moving from one key to another also plays an important role in making such errors. For example, although a faster typing speed than a user’s normal speed increases the occurrence of hitting adjacent key errors, the users’ hesitation which leads to much slower typing speed does not always help to an increase of right typing rate, as observed. Here, the idea is to use these essential parameters, such as key distance, time gap, and error margin distance, to discover the fundamental rules behind users’ typing mistakes. Let us start with the introduction of the popular keyboard—QWERTY keyboard layout, and consider Figs. 8 and 9. In Fig. 8, key S is surrounded by one key distance data set {D, W, E, A, Z , X} and two key distance data set {Q, R, caps lock, F, |, C}. Given certain inputs, if one requires the neural network model to be able to produce the right symbol that a user intends to type, the designed model not only needs to deduce the data set that the right symbol belongs to, but also the right angle the user intends to move toward. This is shown in Fig. 9. All keys surrounding s are positioned with different angles. Let us assume the circle starts from righthand side of s and turns in an anticlockwise direction. Then the key D can be expressed by a three-dimensional vector, keyd = {key = s, di stance = 1, angle = 0}, where key = s is the coordinate origin, distance = 1 and angle = 0 are the keys which is one key distance away from key s with an angle of zero degree. The key A can be expressed as keya = {key = s, distance = 1, angle = π}, distance = 1, angle = π

LI et al.: NEURAL NETWORK APPROACHES FOR NOISY LANGUAGE MODELING

Q Caps Lock

A |

Fig. 8.

W

E S

R F

D

Z

X

C

QWERTY keyboard layout sample.

π/2

E

π

A

D

0

S

Fig. 9. and A.

Relationship-angle between keys and its surrounding keys D, E,

means the key is one key distance away from key s with an angle of π degree. The key distance and time gap between last two grams could determine the error margin between the typed key and the target key. To prove this hypothesis, a neural network topology with distance, angle and time gap vectors in the input layer, and the error margin distance vector between the typed key and target key in the output layer is designed. These require a precise measurement on both input and output parameters. However, given the difficulty of QWERTY keyboard and its associated operating system to respond to an accurate simulation of users’ movement and the difficulty of a neural network to provide a precise output, this solution, as it stands, is not very practical. For example, the difference in angle between key S → key E and key S → key R is not significant. This high precision requirement raises the design difficulty of a neural network model. In practice, the input of neural network model uses (x, y) coordinate expression instead of distance and angle, where x is x-axis key distance (i.e., horizontal distance), and y is y-axis key distance (i.e., vertical distance). x-axis key distance refers to a user’s horizontal move toward the typed key; y-axis key distance refers to a user’s vertical move toward the typed key. When the error margin is calculated, the coordinate center lies at the current typed key. When the distance of last typed key and current typed key is calculated, the coordinate center lies at the last typed key. The sign of key distance will be determined as soon as the coordinate center is fixed. On a QWERTY keyboard there are a maximum of six one key distance keys around each key. The user investigation records suggest that most of hitting adjacent key errors occur in an area where the keys are equal or less than one key distance away from the target keys. Therefore, instead of computing a precise error margin dt, f , the output of neural network model can be designed as a six-class classifier. If one

1781

counts the class in an anticlockwise direction according to traditional coordinate, then, from Fig. 9, d belongs to class one, e belongs to class two and so on. Thus the question can be interpreted as finding an appropriate neural network model to solve a classification issue associated with input vectors; distance, angle, and time gap. It is well known that radial basis networks can require more neurons than standard feedforward backpropagation networks, but quite often they can be designed in a fraction of the time it takes to train standard feedforward networks. One of radial basis networks is PNN which can be used for classification problems. As PNN is a time-efficient and classification-solving solution, in this paper, a 3−N−1 structure model called distance, angle, and time gap (DATP) model is designed based on PNN to predict where the target key could possibly lie against a wrong key press. The DATP model consists of three layers, input layer, hidden layer, and output layer. The hidden radial basis layer compute the distance between the input vector and the hidden weights vector, and then produces a distance vector which indicates how close the input is against the correct letter. The third layer would classify the results of hidden layer and produces the right class. In this experiment, 33 hitting adjacent key errors are identified from the files of data set two, and are converted into the format training data set manually. Here an example is given to show the preprocessing procedure "C" Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(78) "J " Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(108) "BACK" Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(78) "H " Status = (*) Key(*) Extra(*) KeyDistance(*) TimeGap(923) → 3.5 1 108 4 The first four lines are extracted from the data set two. The line following an arrow is the data transformed manually from the lines above, which has four parameters, namely, horizontal distance, vertical distance, time gap between two consecutive keystroke, and class. The first line shows that the horizontal distance from C to J is 3.5 key distances, however, if the move are from J to C, the key distance would be −3.5; the vertical distance is one key distance; the time gap from C to J is 108 ms (shown in red) and the class is 4 as the key H is at the left-hand side of key J . In the case of overlapping keys, a half key distance can be counted. For example "D" Status = (*) Key(68) Extra(*) KeyDistance(*) TimeGap(93) "G" Status = (*) Key(71) Extra(*) KeyDistance(*) TimeGap(218) "H " Status = (*) Key(72) Extra(*) KeyDistance(*) TimeGap(3) → 2.5 0 218 4. This is a typical key press with overlapped key G and key H . The time gap between G press and H press is 3 ms, which

1782

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

TABLE I N EURAL N ETWORKS FOR N OISY L ANGUAGE

PNN prediction of hitting ajacent key 0.4

M ODELING AND P ERFORMANCES

0.35

Model

Dataset

Noisy

Performance

Hitting Rate

0.3

FTDNN n-gram prediction

Dataset 1

No

First HR = 37%, FT HR = 58%1

FTDNN n-gram prediction with noise

Dataset 1

Yes

Noise Rate[0, 0.3] →[37%,27%] 2

0.2

FTDNN n-gram prediction with typed data

Dataset 2

Yes

First HR = 38%, FT HR = 53%3

0.15

TGNN time gap modeling

Dataset 2

No

A = 384.44 ms, Q = 342.50 ms4

Prediction using time gap (PTG)

Dataset 3

Yes

correction rates  [65%, 90%]

Probabilistic neural network modeling (PNN)

Dataset 2

Yes

70%> = Basic Rate5

0.25

0.1

0.05

Basic Rate Pnn Rate 1

2

3 4 5 6 7 8 Training and Testing with Random Sample

9

10

1 2 3 4

The best performance. First hitting rate (2-gram with fifty neurons). The best performance: 38% under 3-gram; 53% under 1-gram. The cost of typing an alphabet order sequence is 384.44 ms, while 342.50 ms in a QWERTY order. 5 70% of all tests score above basic rate.

PNN prediction of hitting ajacent key 0.4 Basic Rate Pnn Rate

0.35 0.3

Hitting Rate

0.25 0.2 0.15 0.1 0.05 0

Fig. 10.

1

2

3 4 5 6 7 8 Training and Testing with Random Sample

9

10

Hitting adjacent key prediction rates based on PNN network.

is much less than the user’s usual typing speed. This is proved by the user’s correction action which happened afterward in data set two. The horizontal key distance between key D and key G is two key distances, however, another 0.5 key distance is added in preprocessing by considering the overlapping. The vertical distance between these two keys is zero, while the time gap is 218 ms and the output class is 4. The experimental results show a correction rate of 50% that is five out of the ten testing samples. Because of the severity of user’s typing disorder and the small size of training data set, a random training and testing data set selection strategy is, however, further adopted. A random function is applied to randomly pick up the training data set and testing data set in a proportion of 2/3 and 1/3, respectively. Two groups of trials are carried out, and each group of them includes ten training and testing samples. The corresponding plots are shown in Fig. 10. The x-axis is the training and testing samples that are picked up randomly; and the y-axis is the prediction rate of the DATP model. The dashed line in red shows the prediction rate of each testing data set according to its training data set; the

line in blue is the random prediction rate in average which is named as basic rate. The first plot of Fig. 10 shows that there are seven rounds whose prediction rates are above basic rate, while the three remaining rounds are below basic rate. The highest score (36%) occurs at the tenth round while the lowest score (7%) occurs at the third round. The second plot shows that there are six rounds out of eight whose prediction rates are above basic rate, while the rest are below basic rate. The highest score (40%) occurs at the third round, while the lowest score occurs at eighth round (0%). Both plots show that there are 70% of all tests scoring above basic rate. They also show a very unstable trend of user’s hitting adjacent key errors behavior. This recommends that the training data set with a small size of data may not be able to give a high prediction rate as the data set has a bad convergence. In that case, several rounds of training with a random data set selection strategy may be required. E. Noisy Language Modeling Summary In the above research, a series of distinct neural network models are developed as learning statistical tools to process noisy language streams (here typing streams), to identify the patterns that a particular person holds, and in practice to provide users with some distinct functions, such as text prediction and text correction. Their functionalities, performances, and related data sets applied are listed in Table I. First, an innovative FTDNN language model is designed and performed with noise-free, noisy, and typing stream data sets. It is developed with extendible numbers of hidden layer neurons and extendible numbers of time delays. Based on user’s typing history, a 38% first HR and a 53% FT HR are obtained. Then, the influence of time gap on a user’s typing performance is studied and a unique Time gap model is developed. Inspired by this result, a PTG model is also developed. The experimental results show that the correction rates predominantly lie in between 65% and 90% with the

LI et al.: NEURAL NETWORK APPROACHES FOR NOISY LANGUAGE MODELING

current testing samples. Furthermore, an original model based on PNN is developed to simulate a specific user typing behavior—hitting adjacent key errors based on key distances. The results show that about 70% of all tests score above basic correction rate. III. C ONCLUSION In this paper a novel concept, namely, noisy language modeling was presented and an intensive neural network language modeling process was applied to a specific application, to analyze disabled users’ typing stream. During the research, a FTDNN language model, a time gap model, a prediction model based on time gap, and a PNN model were developed. Experimental results showed that the neural network language modeling approach proposed by the research produced a higher performance for each individual model. Overall, the modeling process demonstrated that neural network was a suitable and robust language modeling tool to analyze the noisy language stream. In essence, this paper provided pilot modeling approaches by applying distinct neural network models to a new field— noisy language modeling. This paper was a further development of neural network application and information theory. This paper also made further development of neural networks by integrating different technologies, i.e., neural network and n-gram based on an in depth modeling. Meanwhile, it explored implicit variables embedded in the noisy language such as time gap and key distance for modeling use. The research paved the way for the practical application development in the areas of informational analysis, text prediction, and error correction by providing theoretical basis of neural network approaches for noisy language modeling. The further work in the noisy language modeling was twofold. In information theory, Shannon entropy as a minimal message length necessary to communicate information was calculated based on clean English text and estimated as between 1.0 and 1.5 b/character [29], [28]. There is, however, a reason from this paper to believe that the values should be much higher if noises such as human errors were counted. Therefore, a calculation of natural entropy by high performance computing power through extensive noisy data collection and sharing was required. Given the experimental results achieved in this paper, a comparison by applying other effective algorithms such as the combination of neural network and Markov algorithm [4], [23], and a FTDNN modeling expansion using a distributed representation method [18], [25], with larger data set and long-range (or multistep) prediction [27], [36], [40] should be explored. ACKNOWLEDGMENT The authors would like to thank the Disability Essex [44] and Technology Strategy Board [46] fortheir funding, and also would like to thank R. Boyd and P. Collings for helpful advice and discussion, and the anonymous reviewers who provided with helpful comments.

1783

R EFERENCES [1] W. A. Beeching, Century of the Typewriter. New York, NY, USA: St. Martin’s, 1974. [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, Mar. 2003. [3] B. Charles. (2009, Jan. 18). PPMZ-High Compression Markov Predictive Coder [Online]. Available: http://www.cbloom.com/src/ppmz.html [4] G. C. Bohling and M. K. Dubois “An integrated application of neural network and Markov chain techniques to prediction of lithofacies from well logs,” Kansas Geological Survey, Lawrence, KS, USA, Tech. Rep. 2003-50, 2003. [5] C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA, USA: MIT Press, 1999. [6] J. G. Cleary and I. H. Witten, “Data compression using adaptive coding and partial string matching,” IEEE Trans. Commun., vol. 32, no. 4, pp. 396–402, Apr. 1984. [7] D. Hankerson, G. A. Harris, and P. D. Johnson, Jr., Introduction to Information Theory and Data Compression, 2nd ed. Boca Raton, FL, USA: CRC Press, 2003. [8] D. J. Ward, A. F. Blackwell, and D. J. C. Mackay, “Dasher—A data entry interface using continuous gestures and language models,” in Proc. 13th Annu. ACM Symp. User Inter. Softw. Technol., 2000, pp. 129–137. [9] S. P. Day and M. R. Davenport, “Continuous-time temporal backpropagation with adaptive time delays,” IEEE Trans. Neural Netw., vol. 4, no. 2, pp. 348–354, Mar. 1993. [10] P. M. Fitts, “The information capacity of the human motor system in controlling the amplitude of movement,” J. Experim. Psychol., vol. 47, no. 6, pp. 381–391, Jun. 1954. [11] J. Grim, P. Somol, and P. Pudil, “Probabilistic neural network playing and learning Tic-Tac-Toe,” J. Pattern Recognit. Lett., Artif. Neural Netw. Pattern Recognit. Archive, vol. 26, no. 12, pp. 1866–1873, Sep. 2005. [12] D. Jurafsky and J. H. Martin, Speech and Language Processing: International Version: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2008. [13] K. Sayood, Introduction to Data Compression, 3rd ed. San Mateo, CA, USA: Morgan Kaufmann, 2005. [14] H. Jiawei and K. Micheline, Data Mining—Concepts and Techniques. San Diego, CA, USA: Academic, 2001. [15] H. Evelyn, Thomas Hardy: A Critical Biography. London, U.K.: Hogarth, 1954. [16] P. Haffner and A. Waibel, “Multi-state time delay neural net works for continuous speech recognition,” in Advances in Neural Information Processing Systems, vol. 4. Cambridge, MA, USA: MIT Press, 1992, pp. 135–142. [17] H. Simon, Neural Networks—A Comprehensive Foundation, 2nd ed. Blowing Rock, NC, USA: Tom Robbins, 1999. [18] G. E. Hinton, “Learning distributed representations of concepts,” in Proc. 8th Annu. Conf. Cognit. Sci. Soc., 1986, pp. 1–12 [19] I. Anagnostopoulos, C. Anagnostopoulos D. Vergados V. Loumos, and E. Kayafas, “A probabilistic neural network for human face identification based on fuzzy logic chromatic rules,” in Proc. 11th Medit. Conf. Control Autom., Jun. 2003, pp. 1–6. [20] J. Li, K. Ouazzane, H. Kazemian, Y. Jing, and R. Boyd, “A neural network based solution for automatic typing errors correction,” Neural Comput. Appl., vol. 20, no. 6, pp. 889–896, 2011. [21] J. Li, K. Ouazzane, H. Kazemian, Y. Jing, and R. Boyd, “Focused timedelay neural network modeling towards typing stream prediction,” in Proc. IADIS Int. Conf. Intell. Syst. Agents, 2009, pp. 189–194. [22] T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. 11th Annu. Conf. Int. Speech Commun. Assoc., 2010, pp. 1045–1048. [23] U. Ohler, G. Stemmer, and H. Niemann, “A hybrid Markov chain— Neural network system for the exact prediction of eukaryotic transcription start sites,” Genome Res. World Sci. Genome Res., vol. 10, no. 10, pp. 539–542, Jan. 2000. [24] K. Ouazzane, J. Li, and M. Brouer, “A hybrid framework towards the solution for people with disability effectively using computer keyboard,” in Proc. IADIS Int. Conf. Intell. Syst. Agents, 2008, pp. 209–212. [25] A. Paccanaro and G. E. Hinton, “Extracting distributed representations of concepts and relations from positive and negative propositions,” in Proc. Int. Joint Conf. Neural Netw., 2000, pp. 254–264.

1784

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

[26] H. Schwenk and J. Gauvain, “Training neural network language models on very large corpora,” in Proc. Conf. Human Lang. Technol. Empirical Methods Natural Lang. Process., 2005, pp. 201–208. [27] B. Schenker and M. Agarwal, “Long-range prediction for poorly-known systems,” Int. J. Control, vol. 62, no. 1, pp. 227–238, 1995. [28] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 379–423 Jul. 1948. [29] C. E. Shannon, “Prediction and entropy of printed English,” Bell Syst. Tech. J., vol. 30, pp. 50–64, Jan. 1951. [30] S. Trewin, “Automating accessibility: The dynamic keyboard,” in Proc. 6th Int. ACM SIGACCESS Conf. Comput. Access., 2003, pp. 71–78. [31] S. Trewin and H. Pain, “A model of keyboard configuration requirements,” in Proc. Int. ACM Conf. Assist. Technol., 1998, pp. 173–181. [32] D. Shi, H. J. Zhang, and L. M. Yang, “Time-delay neural network for the prediction of carbonation tower’s temperature,” IEEE Trans. Instrum. Meas., vol. 52, no. 4, pp. 1125–1128, Aug. 2003. [33] S. William and M. Scott. (2009, Jan. 18). KeyCapture [Online] Available: http://dynamicnetservices.com/~will/academic/textinput/keycapture/ [34] R. W. Soukoreff and I. S. MacKenzie, “Input-based language modeling in the design of high performance text input techniques,” in Proc. Graph. Interf., 2003, pp. 89–96. [35] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1998. [36] H. T. Su, T. J. McAvoy, and P. J. Werbos, “Long-term predictions of chemical processes using recurrent neural networks: A parallel training approach,” Ind. Appl. Chem. Eng. Res., vol. 31, no. 5, pp. 1338–1352, 1992.. [37] T. Ganchev, A. Tsopanoglou, N. Fakotakis, and G. Kokkinakis, “Probabilistic neural networks combined with gmms for speaker recognition over telephone channels,” in Proc. 14th Int. Conf. Digit. Signal Process., vol. 2. Jul. 2002, pp. 1081–1084. [38] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. Acoust., Speech, Signal Proces., vol. 37, no. 3, pp. 328–339, Mar. 1989. [39] P. D. Wasserman, Advanced Methods in Neural Computing. New York, NY, USA: Van Nostrand, 1993. [40] J. X. Xie, C.-T. Cheng, K.-W. Chau, and Y.-Z. Pei, “A hybrid adaptive time-delay neural network model for multi-step-ahead prediction of sunspot activity,” Int. J. Environ. Pollut., vol. 28, nos. 3–4, pp. 364–381, 2006. [41] W. Xu and A. I. Rudnicky, “Can artificial neural networks learn language models,” in Proc. ICSLP, 2000, no. M1-13, pp. 1–4. [42] A. Yazdizadeh and K. Khorasani, “Adaptive time delay neural network structures for nonlinear system identication,” Neurocomputing, vol. 47, pp. 207–240, Aug. 2002. [43] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, Feb. 2003. [44] (2009, Jan. 18). Disability Essex [Online]. Available: http://www.disabilityessex.org [45] (2008, Jan. 25). Inference Group of Cambridge, Dasher [Online]. Available: http://www.inference.phy.cam.ac.uk/dasher/ [46] (2009, Jan. 18). Knowledge Transfer Partnership [Online]. Available: http://www.ktponline.org.uk/ [47] Sensory Software International Ltd. (2008, Feb. 15). ProtoType, Worcestershire, U.K. [Online]. Available: http://www.sensorysoftware.com/prototype.html [48] (2009, Feb. 05). Virtual Key Codes [Online] Available: http://api.farmanager.com/en/winapi/virtualkeycodes.html [49] (2009, Nov. 13). What is Statistical Language Modeling (SLM) School Inf., University of Edinburgh, Edinburgh, U.K. [Online]. Available: http://homepages.inf.ed.ac.uk/lzhang10/slm.html

Jun Li received the B.Sc., M.Sc., and Ph.D. degrees. He is a Research Associate with the University of Oxford, Oxford, U.K., working on mathematical model development and data management and analysis in radiation oncology and biology. He was a Research Associate of mathematical modeling and applied machine learning with the University of Cambridge, Cambridge, U.K., working on a large scale computational model development and related data analysis from 2009 to 2012. He is a member of the Intelligent Systems Research Centre, London Metropolitan University, London, U.K., and a Visiting Lecturer with the Faculty of Computing, London Metropolitan University, working on intelligent algorithms and computer security.

Karim Ouazzane is a Professor of computing and knowledge exchange and the Deputy Director of the Intelligent Systems Research Centre, London Metropolitan University, London, U.K. He has been contributing to research pertinent to AI and expert systems and produced internationally recognized approaches which have been disseminated through journal papers and conferences. He works closely with industry and developed two proofs of concepts which have been patented.

Hassan B. Kazemian (SM’88) received the B.Sc. degree in engineering from Oxford Brookes University, Oxford, U.K., the M.Sc. degree in control systems engineering from the University of East London, London, U.K., and the Ph.D. degree in learning fuzzy controllers from the Queen Mary University of London, London, in 1985, 1987, and 1998, respectively. He is currently a Full Professor with London Metropolitan University, London, U.K. He was with Ravensbourne College University Sector, U.K., as a Senior Lecturer, for eight years. His previous lecturing experiences include the University of East London, London, University of Northampton, Northampton, U.K., and Newham College, U.K. His current research interests include artificial intelligence applications to QWERTY keyboard, networks, and bioinformatics. Prof. Kazemian is a fellow of the Institution of Engineering and Technology FIET (formerly IEE) U.K. and Chartered Engineer, U.K.

Muhammad Sajid Afzal is currently pursuing the Ph.D. degree with London Metropolitan University, London, U.K. He is currently a Software Developer in a software house. He has contributed to research in the general area of intelligent systems. He has published and presented two papers in ICEIS 11 Conference in Bejing, China.

Neural network approaches for noisy language modeling.

Text entry from people is not only grammatical and distinct, but also noisy. For example, a user's typing stream contains all the information about th...
669KB Sizes 0 Downloads 0 Views