AN EXPERT SYSTEM FOR ENVIRONMENTAL DATA MANAGEMENT PETR BERKA and PETR JIRKU Dept. of Information and Knowledge Engineering, Faculty of lnformatics and Statistics, Prague University of Economics, W. Churchill Sq. 4, CZ-13067 Prague 3, Czech Republic

Abstract. In this paper we show the possiblity of using expert system tools for environmental data management. We describe the domain indenpendent expert system shell SAK and Knowledge EXplorer, a system which learns rules from data. We show the functionality of Knowledge EXplorer on an example of water quality evaluation.

1. Introduction Any successful decision-making is strongly dependent upon various capabilities which include the effective acquisition, storage, distribution and sophisticated use of the knowledge of human experts in the field. In the context of computer-aided systems for monitoring and information processing these capabilities can be achieved with different technologies as, for example: • • • •

systems for information retrieval; expert systems; deductive data bases; integrated knowledge-based systems.

Systems for information retrieval are usually implemented via standard data base systems. Such systems usually work with thousands of records which have to be regularly updated and searched, so effective methods for storing and accessing large amounts of data are of great importance. Expert systems employ human knowledge to solve problems that originally require human intelligence. Expert systems consist of two main parts: knowledge base and inference mechanism. Knowlwedge base contains expert knowledge for a given domain (typically in the form of rules), and inference mechanism is a domain independent program which works with the knowledge base in order to reach final conclusions (recommendations, diagnoses). During the consultation the system asks questions relevant to the investigated conclusion. Deductive data bases form an application domain for techniques that are based on revisable reasoning. They extend standard relational data bases in a natural way, i.e. the scheme of knowledge representation is extended from simple facts to rules which are described by more complex formulas. Querying a data base then becomes a kind of a deductive process instead of simply accepting or refuting the facts involved in a data base. Environmental Monitoring and Assessment 34: 185-195, 1995. (~) 1995 Kluwer Academic Publishers. Printed in the Netherlands.

186

PETR BERKA AND PETR JIRKU

It allows more sophisticated procedures to be used to derive knowledge items which are not explicitly included in a data base but can be deduced from it. Integrated knowledge-based systems can integrate all three methodologies mentioned above into more powerful systems. In this paper we will focus our attention on the methodology of expert systems. There is a large amount of both software and methodologies for building expert systems ranging from simple expert system shells (empty expert systems, which can be used for different applications) for about 100 US dollars (e.g. VP-Expert) to powerful development environment at a price of about 10 000 US dollars (e.g. Knowledge Engineering Environment (KEE)). At the Prague University of Economics we have developed our own methodology and software products which can be used for building problem-oriented knowledge bases.

2. The Expert System SAK SAK (System for Automated Consultations) is a rule-based expert system shell which allows diagnostic expert systems in various areas to be created in a simple way. This shell has already been used for applications in different problem domains, such as for the expert system OPTIMALI for selecting an appropriate mathematical decision method, the expert system TEAM recommending teamwork improvements or the expert system ZPDOPRAV for evaluating traffic impacts on environmental polution. The knowledge used in this system is represented in the form of: • propositions - basic items of knowledge used in SAK, e.g. the patient is a woman or the concentration of nitrates is low; • W-THEN rules in the form C : Ant ~ Suc(w) where C (context) is a single proposition, Ant (antecedent) is a combination of propositions, Suc (succedent) is a proposition, and w (weight) is used to express the degree of uncertainty of the rule; the condition of the rule (the antecedent) is evaluated, if the context has positive weight; • numerical variables (e.g. age, concentration, electric current); • external procedures (e.g. electric current can be computed from potential and resistence). A knowledge base consists of hundreds of propositions and rules. According to their location in the rules, the propositions can be divided into three groups: 1. Questions. Propositions which occur only in the antecedents (these are the questions posed by the user during consultation). 2. Goals. Propositions which occur only in the succedents (these are the final conclusions, recommendations or diagnoses of the system).

EXPERT SYSTEM FOR DATA MANAGEMENT

187

3. Intermediate propositions. Propositions which occur both in antecedents and succedents of (different) rules (these propositions allow inference chains to be constructed where a sequence of rules form a question to a goal). The inference mechanism used in SAK selects the rules for evaluation according to their location in the knowledge base. The system dynamically selects the next rule for evaluation according to the current state of inference (that is evaluated rules and user's answers to the questions). The system uses uncertainty both in expert knowledge available (expressed as weights of rules) and in the user's answers during consultation. The uncertainty is expressed using numbers from a scale ( - E , E), where - E means 'certainly no', E means 'certainly yes' and 0 means 'unknown'. SAK offers two ways to process the uncertainty: 1. Standard (pseudobayesian), based on classical PROSPECTOR (Duda and Gashing, 1979) and MYCIN (Shortliffe, 1976) like approach. 2. Many-valued (fuzzy) logic, based on application of the completeness theorem for Lukasiewicz many-valued logic where the knowledge base is based upon an uncertain axiomatic theory (Iv~mek et aL, 1989). These two methods differ in how to evaluate: • • • •

the the the the

weight of the negation of a proposition; weight of the combination of several propositions; contribution of a rule to the weight of the succedent; combination of several rules with the same succedent.

Using these four functions, the system evaluates the weights of all goals according to the answers obtained from the user. During consultation, the user can also ask questions How, Why, What if. These exploration facilities can give the user a deeper insight into the reasoning process and so give him or her arguments for acceptance of system's recommendation. It is assumed that the rules for SAK are gained either by iterative consultation between knowledge engineers and experts, or that they can be derived automatically from data. A method of mechanizing this process is described in the following section.

3. The Knowledge EXplorer The system KEX (Knowledge EXplorer) can be used for: • systematic basic analysis of multidimensional categorial data (data matrix which contains such characteristics as sex (male, female), hair-color (red, blond, black, brown) etc.) with the aim of finding relations of implication and equivalence between pairs of combinations of categories;

188

PETR BERKA AND PETR JIRKU

• acquiring knowledge bases of rules from multidimensional data without human experts. The first type of tasks, also called combinational data analysis, was inspired by the G U H A system, which was designed to find all relevant relations in the data (H~ijek and Havr~nek, 1978). Knowledge EXplorer performs several tasks, which differ according to desired type of relations: • • • •

specific evaluation; complete exploration; analysis of conclusions; analysis of causes.

The basic idea of these tasks is to find all the 'interesting' relations in the given data. Those of interest are specified by the user by selecting the task and input parameters. All these tasks can be viewed as exploratory data analysis methods. The resulting relations (implications between pairs of combinations of attribute-value tuples) can be interpreted as knowledge about dependencies between attributes in a given data. For every such implication in the form Ant =~ Suc, where Ant and Suc are two combinations (combination is, e.g., sex = female and hair-color = brown), we can create a fourfold contingency table in the form (Ant = antecedent; Suc = succedent) shown in Table I.

TABLE I Fourfold contingency table.

Ant non(Ant)

Suc

non(Suc)

a c

b d

where a is the number of objects which fulfil both Ant and Suc, b is the number of objects which fulfil Ant but do not fulfil Suc, c is the number of objects which do not fulfil Ant but fulfil Suc, d is the number of objects which do not fulfil Ant and do not fulfil Suc. We can define the validity of implication Ant =~ Suc as the relative frequency of Ant which occurs together with Suc (conditional probability P(Suc/Ant)) a

P(Suc/Ant) = . . a +b

EXPERT SYSTEM FOR DATA MANAGEMENT

189

We can also define coverage of the implication Ant =~ Suc as the relative frequency of Suc which occurs together with Ant (conditional probability P(Ant/Suc)) P(Ant/Suc) -

a

a+c

So the validity of Ant ~ Suc equals the coverage of Sue ~ Ant and vice versa. Validity and coverage are in the range (0, 1) ((0%, 100%)). If validity of Ant :=~ Suc equals p then validity of Ant ~ -~Suc equals 1 - p (~[00 - p%). The validity of an implication expresses how strong the left-hand side combination is related to the right-hand side combination (if all objects which fulfil the left-hand side fulfil also the right-hand side, the validity equals to 1). The coverage of implication expresses how strong the right-hand side combination is related to the left-hand side combination (if all objects which fulfil the right-hand side fulfil also the left-hand side, the coverage equals to 1). So both measures gives us a quantitative description of the implication in question. The knowledge acquisition component performs symbolic empirical multiple concept learning from examples (cases), where the induced concept description is of the form of a set of weighted decision rules: Ant ~ G(weight) where Ant = Jl cl ...jkck is a combination (conjunction of attribute-value tuples, also called categories of length k, G is the fixed goal combination. Our idea of knowledge acquisition is to find the minimal set of rules which describes the given data and which can directly be used for consultation in a PROSPECTOR (Duda and Gasching, 1979) like expert system. The knowledge acquisition component consists not only of a module for learning but also of a testing module and a consultation module. KEX can deal with noisy data, unknown values, redundancy and contradictions. The algorithm does not perform incremental learning steps. As in the analysis of causes, the knowledge base is generated for a user-given right-hand side combination G (e.g. list of diagnoses). KEX works in an iterative way, each iteration testing and expanding an implication Ant :=~ G. This process starts with an 'empty rule' with weight as the relative frequency of G in the data and stops after testing all the implications which were created according to user defined criteria. The implications are evaluated according to the decreasing frequency of Ant, so that the most reliable implications are tested first. During testing, the validity (conditional probability P(G/Ant)) of an implication is computed. If this validity differs significantly from the composed weight (value obtained when composing weights of all subrnles of the implication Ant

190

PETR BERKA AND PETR JIRKU

G), then this implication is added to the knowledge base. The weight of this new rule is computed from the validity and the composed weight using an inverse composing function. For composing weights we use PROSPECTOR's combining function

x @ y : (x * y ) / ( x * y + ( 1 - x) * ( 1 - y)). During expanding, new implications are created by adding single categories to Ant. These categories are added in descending order of their frequencies. New implications are stored (according to frequencies of Ant) in an ordered list of implications. Thus KEX generates every implication only once and for any implication in question all its subimplications have been already tested. The system can learn rules for a single concept described as a goal combination (conjuction of categories) or for multiple disjoint concepts, which correspond to different categories of a given attribute.

4. Problems of Building Knowledge Bases In this section we describe the main problems that must be solved when applying the methodologies described above. In our opinion, these problems can be gathered into the following three categories: 4.1.

PROBLEM OF KONWLEDGE ACQUISITION

This concerns the question of acquisition and storing of knowledge and data. Knowledge acquisition is the main difficulty in building expert system applications. Knowledge acquisition when done classically as in repeating knowledge elicitation sessions between the knowledge engineer and domain expert is a very time-consuming process. The question as how to automate this process is now being studied in the AI community. Machine learning techniques (such as in the KEX system) can be used for this purpose. 4.2.

PROBLEM OF KNOWLEDGE TRANSFER

This is the task of knowledge distribution, especially the relation between conceptual schema and details of implementation. 4.3.

PROBLEM OF ACCEPTANCE OF A SYSTEM BY THE END-USER

This is not only the question of the appropriate use of knowledge but also that of integration of the developed knowledge-based system into the user's computational environment and the user-friendliness of the interface. We will discuss these problems by describing an appropriate case study based on selected environmental data.

EXPERT SYSTEM FOR DATA MANAGEMENT

191

5. Water Quality: a Case Study When using environmental data, we have to solve an additional problem. As shown earlier, expert systems methodology is based more on symbolic computing than on a numerical one. So the original data, which are numerical in nature (e.g. results of measurements) must be categorized. This can either be done manually based on background knowledge about the domain or (in some systems) automatically using some statistical methods. For the case study, we selected the problem of water quality evaluation. The analysis was performed with the aim to find as many knowledge items as possible about the relations between the descriptors of inorganic and organic components in selected rivers. We used the KEX system to find such relations automatically. The input data consists of following 12 attributes (descriptors) collected for 31 measurement points: oxygen; BSK 5 (biological consumption of oxygen measured within 5 days); CHSK Mn (chemical consumption of oxygen measured using Mn); CHSK Cr (chemical consumption of oxygen measured using Cr); soluble substances; insoluble substances; amoniacle N; nitrates; total phosphor; chlorides; sulphates; index of saprobity (occurrence of algae). The data were obtained from the Report on the Environment in the Czech Republic in the year 1991 where all attributes are already categorized into five categories: class I, class II, class III, class IV and class V. KEX uses numbers to code the attributes and single characters to code (categorized) values of attributes; so that 12a denotes 'attribute 12 has value a which means the index of saprobity belongs to the class I'. If we are interested in particular relations (dependencies of occurrence of algae on unorganic components in our case study data) we have to run exploration analysis tasks. We run the analysis of causes task for attribute 12 as the goal for the analysis. The relations obtained by this exploration task can be viewed as knowledge about the causes of a different index of saprobity which is to be used (visually interpreted) by a human. The listing in Table II shows the first 10 of 503 relations found for the parameters: goal attribute is the index of saprobity (attribute no 12), maximal length of the lefthand side of the implication is 5, minimal frequency of the left-hand side of the implication is 1 and minimal validity of the implicatin is 80. In the listing in Table II, each row describes one implication Ant ~ Suc. So for example in the row 4 11 21 10 0.9091 0.4762 3c4d=~12c • • • •

4 is the number of the implication, 11 is the number of objects which fulfil the left-hand side of the implication, 21 is the number of objects which fulfil the right-hand side of the implication, 10 is the number of objects which fulfil both the left-hand side and right-hand side of the implication,

192

PETR BERKA AND PETR JIRKId

TABLE I! Generated implications. No.

Frequencies Validity Coverage Implication Left Right Both

1 2 3 4 5 6 7 8 9 10

16 13 11 11 10 10 9 8 8 8

21 21 21 21 21 21 21 21 21 21

13 11 9 10 8 8 9 7 7 7

0.8125 0.8462 0.8182 0.9091 0.8000 0.8000 1,000 0.8750 0.8750 0.8750

0.6190 0.5238 0.4286 0.4762 0.3810 0.3810 0.4286 0.3333 0.3333 0.3333

4d:=~12c 3c:=~12e 9c~12c 3c4d~12c 10b=~,12c 4d5b=~12c 9c8c=~12c llb4d=~12c 7d4d=~12c 10b4d=~12c

• 0.9091 is the validity of the implication, • 0.4762 is the coverage of the implication, • 3 c 4 d =~ 12c is the implication: IF

CHSK Mn == class III

AND

CHSK Cr == class IV

THEN Index of saprobity == class III. In the case of imploratory analysis, it makes sense to interpret each relation separately. So the implications found can serve as initial knowledge for further discussions with the domain expert. We can also directly create the knowledge base from the input data. When creating the knowledge base for the same input parameters as in the exploration task, we obtained 20 rules from 503 tested implications, as shown in Table III. The first four rules in Table III are the 'default rules'. If no information is available about a new case, only these rules would be activated. For our knowledge base this will result in assigning the new case to the class 12c. The performance of the knowledge base (given as a number of successful classifications of the training objects) is 97% (Table IV). For every class (a row in Table IV) the number of classifications carried out by the system and the number of correct classifications is given. Some examples in the data set may be unclassified; either all resulting weights were in the range (0.45, 0.55 / (the row 'not decided' in Table IV), or there was not applicable rule in the knowledge base (the row 'not predicted' in Table IV). If the knowledge base contains a default rule, prediction is always carried out. The resulting performance of the system is given in the total row.

193

EXPERT SYSTEM FOR DATA MANAGEMENT

TABLE III Generated rules.

No.

Frequencies Left Right

Both

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

31 31 31 31 7 6 6 5 5 5 5 5 5 1 1 1 1 1 1 1

8 21 1 1 7 6 6 5 5 5 5 5 5 1 1 1 1 1 1 1

8 21 1 1 8 8 8 8 8 8 8 8 8 1 1 1 1 1 1 1

Weight

0.5054 0.7849 0.0645 0.0645 0.9320 0.9215 0.9215 0.9073 0.9073 0.9073 0.9073 0.9073 0.9073 0.9667 0.9667 0.9667 0.9667 0.9667 0.9667 0.9667

Implication

0-~12b 0-=~12e 0-~12d 0-=~12e

2cla8e=~12b 3b2cla=~12b 9e2ela=>12b

llb2ela=>12b 9e2c5b=>12b 7e5bla8c=>12b 9e3bla=C~12b 9e365b=~12b 7e3b5bla=C~12b le=~12e

lla7d=~12d 5a7d=~12d 7el0a=~12e 7e5a=~12e

5alla7d=~12d 7dlalOa=>12d

TABLE IV Results of knowledge base testing in training data. Goal abs.

Total rel.

From which true false

Total

From which true false

12b 12c 12d 12e

7 22 1 1

23% 71% 3% 3%

7 21 1 1

0 1 0 0

100% 100% 100% 100%

100% 95% 100% 100%

0% 5% 0% 0%

Total

31

100%

30

1

100%

97%

3%

0 0

0% 0%

31

100%

30

1

100%

97%

3%

Not decided Not predicted Total

194

PETRBERKAANDPETRJIRKU

As we can see from Table IV showing the results, the system made only one mistake when classifying the training data. However, this gives only little information as to how the knowledge base obtained will perform on unseen cases. The total number of objects in the training set is too low to consider the knowledge base obtained as being reliable. The knowledge base can be used to predict the occurrence of a goal combination for new cases. This prediction can be performed using an inference mechanism built in the KEX or using another (more comfortable) expert system. When visually interpreting the knowledge base, sometimes some 'obvious' piece of knowledge cannot be found. This is because the effect of the corresponding 'missing' rule can be composed from its (more general) subrules, which are already in the knowledge base. So this rule is redundant and thus not inserted. Therefore, the knowledge base has to be taken into account as a whole.

6. Conclusion In this contribution we are primarily concerned with the problem of finding a minimal set of rules (knowledge items) which can be successfully used for the construction of a reasonable knowledge base from data. The case study described in section 5 gives us a good argument for using expert systems tools together with the methodology of automatic knowledge acquisition as implemented in KEX in the area of environmental data management for advanced data processing. For more realistic applications, it will be necessary to standardize different methods of data collection in such a way that larger training and testing data sets will be easily accessible to KEX. Of primary concern is the question of categorization of the raw data which in turn depends on background knowledge of the problem domain.

References Berka, P.: 1991, 'Expert SystemSAK and its Applications', in: Selected Reports of the Department of Information and Knowledge Engineering, VSE, Praha, pp. 18-28. Berka, P.: 1993, 'KnowledgeEXplorer. A Tool for AutomatedKnowledgeAcquisitionfrom Data', Technical Report, Austrian Research Institute for Al, Vienna,TR-93-03, 23 pp. Barr, A. and Feigenbaum, E.A.: 1981, The Handbook of Artificial Intelligence, Vol. I, II. William KaufmanInc., Los Altos, U.S.A. Duda, R.O. and Gasching,J.E.: 1979, 'ModelDesignin the ProspectorConsultantSystemfor Mineral Exploration', in: Michie,D. (ed), Expert Systems in Micro Electronic Age, EdinburghUniversity Press, UK. H~ijek,P. and Havr~inek,T.: 1978, Mechanizing Hypothesis Formation - Mathematical Foundations for a General Theory. Berlin, Springer-Verlag. Iv~inek, J. and Stejskal, B.: 1988, 'Automatic Acquisition of KnowledgeBase from Data without Expert: ESOD (Expert Systemfrom Observational Data)', in: Proc. COMPSTAT'88. PhysicaVerlag, Heidelberg,pp. 175-180. Iv~inek, J., Svenda, J., and Ferjen~, J.: 1989, 'Inference in Expert Systems Based on Complete Multivalued Logic', in: Proc. of the Workshop on Uncertainty Processing in Expert Systems, Suppl. of Kybernetika 25, 25-32. Report on the Environmentin Czech Republic. Yearbook,CzechEcologicalInstitute, 1991.

EXPERT SYSTEMFOR DATAMANAGEMENT

195

van Melle, W. et al.: 1981, The Emycin Manual, Stanford University, Dept. Comp. Sci., Report No. STAN-CS-81-885, October. Sandhal, K.: 1992, 'Developing Knowledge Management Systems with an Active Expert Methodology', LinkOping Studies in Science and Technology, Dissertation No. 277, LinkOping University. Shortliffe, E.: 1976, Computer-based Medical Consultations: MYCIN, American Elsevier.

An expert system for environmental data management.

In this paper we show the possibility of using expert system tools for environmental data management. We describe the domain indenpendent expert syste...
562KB Sizes 0 Downloads 0 Views