An overview of speaker recognition

Junxia Liu; CL Philip Chen; Tieshan Li; Yi Zuo; Peichao He; Junxia Liu; CL Philip Chen; Tieshan Li; Yi Zuo; Peichao He

ISSN: 2641-3086

Trends in Computer Science and Information Technology

Review Article Open Access Peer-Reviewed

An overview of speaker recognition

Junxia Liu¹, CL Philip Chen^1,2, Tieshan Li^1, Yi Zuo¹ and Peichao He¹

Author and article information

¹Dalian Maritime University, Dalian 116026, China
²University of Macau, Macau 99999, China

*Corresponding author: CL Philip Chen and Tieshan Li, Department of Navigation Institute, DaLian Maritime University, Linghai Road Ganjingzi District, Dalian, Liaoning, Room 518, China, Tel: 010-0411-84729255; E-mail: liujunxiadlmu@163.com

DOI: 10.17352/tcsit.000009

Received: 19 July, 2019 | Accepted: 26 August, 2019 | Published: 28 August, 2019

Keywords: Speaker recognition; Feature extraction; MFCC; Deep learning; End-to-end model

Cite this as

Liu J, Philip Chen CL, Li T, Zuo Y, He P (2019) An overview of speaker recognition. Trends Comput Sci Inf Technol 4(1): 001-012. DOI: 10.17352/tcsit.000009

Abstract

Speaker recognition has been studied for many years and has been a hot topic. This paper presents an overview of speaker recognition methods, which include the classical and the state-of-art methods. According to the modular components of speaker recognition system, we firstly introduced the fundamentals of speaker recognition, which are mainly divided into two parts: feature extraction and speaker modeling. The most commonly speech features used in speaker recognition were elaborated firstly. In particular, the recent progress of deep neural network proposes a new approach of feature extraction and has become the technology trend. Secondly, the classical approaches of speaker recognition model were introduced, and elaborated the recent progress of deep learning speaker recognition. This paper especially provides an in-depth analysis on end-to-end model which consists of a training component to extract features, an enrollment component to training the speaker model, and an evaluation component with appropriate loss function for optimization. The final part concludes the paper with discussion on future trends.

Main article text

Introduction

Speaker recognition bas become the most popular methods in biometric identification field, because the voice is the most common signal and the simplest to acquire [1,2]. With the wide application of artificial intelligence machines, researchers have found that voice communication is the best way to communicate between humans and machines. In access control field, telephone services of transaction authorization field, and speaker diarization field, speaker recognition has been applied extensively [3,4]. In general, the speaker recognition system can fall into two categories: speaker identification (SI) and speaker verification (SV). Speaker identification is the process to determining who is talking from a group of people, and the system must perform a 1: N classification. Speaker verification is the task to determining whether a person is who he/she claims to be (a yes/no decision). Generally, speaker identification can be divided into “closed-set” and “open-set”, it is supposed that the training voice come from a fixed set of know speakers, thus the task is referred to as closed-set identification. Instead, it is assumed that training voice are not known to the system, that is refered to as open-set identification [5-7].

The speech used for speaker recognition can be grouped into text-dependent (TD) and text-independent (TI). In the text-dependent (TD) application, the recognition system has prior knowledge of the text to be spoken and it is expected the recognition must be pronounced according to the prior knowledge. Because of prior knowledge, the text-dependent (TD) recognition can greatly improve performance of recognition system. In a text-independent (TI) application, there is no fixed text for the recognition system to be spoken. Since there is no prior knowledge, text-independent speaker recognition is more difficult but also more flexible. As the speech recognition accuracy improved and the speaker and speech recognition system rapid develop, the distinction between text-independent and text-dependent application have been decreased [8-13].

The methods of speaker recognition’s research and development have been studied over five decades, which still is an active area. The methods of speaker recognition are spanned from the original human aural spectrogram comparisons to simple template matching, to dynamic time-warping approaches, to more modern statistical pattern recognition approach and to the most popular deep learning in recent years.[14-18]. Especially noting that, the methods applied to speech recognition have also been often used in speaker recognition (SR). The corpora’ research and development are from small, private corpora to large, open source corpora. This domain has natured to the degree that commericial applications of SR have been growing steadily since the mid-1980s, and many large companies have this technology, such as Google, Baidu, IBM and Microsoft etc have set up speech research groups.

This paper is a general overview of speaker recognition technologies, that introduce the classic techniques from 1987 until today. Meanwhile, we focus on the recent techniques that shifted from deep neural network models to end-to-end models. The remaining of this overvies is organized as follow: section 2 introduce the development of speaker recognition. Section 3 introduce fundamentals of speaker recognition. Section 4 and 5 elaborate feature extraction and speaker modeling process. Section 6 is then devoted to the decision method. The end-to-end model was introduced with emphasis in section 7. Finally, the conclusion and the future research trends of recognition technology are outlined in Section 8.

Overview

Speaker recognition can be put into four stages. The first stage was from 1960s to 1970s, the research focused on feature extraction and template matching technique. In 1962, Kesta at Bell LABS proposed spectrogram method for speaker recognition [19]. In 1969, Luck proposed Cepstrum technology [20]. In 1976, Atal et al. proposed the Linear Predictive Cepstrum Coefficients (LPCCs), which improved the accuracy of speaker recognition [21]. In terms of the model, template matching was mainly adopted in the 1960s. In the 1970s, Dynamic Time Warping (DTW) and Vector Quantization (VQ) technology became the mainstream.

The second stage was from the 1980s to the 1990s, speech statistical models are beginning to be applied to speaker recognition [22-24]. In terms of feature extraction, Davis proposed Mel-Frequency cepstrum parameter (MFCC) for speaker recognition which becomes the mainstream feature in the following years [25]. In terms of models, the classical approach has been devided into two types. The first types are based on vector quantization and dynamic time wrapping, which are referred to as template-based models. The second types are stochastic models which based on Gaussian Mixture Model (GMM) [26] or Hidden Markov Model (HMM) [27,28]. The majority of the state-of-the-art SR systems adopted MFCC as features and Gaussian mixture model (GMM) was used for speaker modeling [29-31]. The Gaussian Mixture Model has been proved extremely successful in TI speaker recognition.

The third stage around 2000, GMM-based speaker recognition methods has been the most commonly used and which proposed by Reynolds, which include the classical Maximum a-Posteriori (MAP) adaptation of universal background model parameters (GMM-UBM) [32] and support vector machine (SVM) classification of GMM supervectors (GMM-SVM) [33].

In the training phase, the MAP adaptation framework provides a way of incorporating prior information by adapting the parameters of GMM from the UBM. The framework is available in dealing with problems posed by sparse training data [34]. The SVM uses a non-linear function to map data on to a higher (possibly infinite) dimensional space and then finds the best hyper-plane separating the two classes in this space [34,35]. Since the SVM is basically a two-class classifier.

However, in terms of data, the high accurate rate only can be achieved under ideal conditions and is appropriate for practical application under matched channel conditions. Instead, the performance can degrade significantly under mismatched conditions. After 2010,

The barrier associated with compensating for these differences have offered an active research focus for the SV field and some of the most advanced channel compensation schemes include joint factor analysis (JFA) [36], i-vectors [37], or nuisance attribute projection (NAP) [38]. Meanwhile, using the fusion information from different sourec of evidence, which can improve system performance.

The fourth stage began in the early 2000s (2010), deep learning promotes the development of speaker recognition. At this stage, the development of speaker recognition technology is drivern by the commercial needs, while deep learning, big data and genetal graphics computing uint (GPU) also promote the development of speaker recognition. The various deep neural networks based were proposed for speaker cognition methods [39]. At the feature extraction of frame-level, researchers apply deep neural networks to extract Bottlenect (BN) features [40], d-vector [41], j-vector [42], and x-vector [43]. At the model-level, the research focuses on various deep neural networks (DNNs) for acoustic feature modeling. The decisions were maked utilize the distance between the target feature vector and the test feature vector. But speakers are often unknown during system training, this makes a big challenging for SR.

Whether the i-vector system, or feature vector extracted from DNN system, it usually consists of three modules, the first is the training module which calculate the representations of speaker, the second is enrollment module which estimate the speaker model, the last is evaluation module which have an appropriate loss function for optimization. A new approach has been proposed in which all the modules can be jointed together. Compared with the present methods, such an end-to-end(E2E) method direct modeling utterances and directly joint estimation, which result in better and more compact models. Moreover, this approach ofter results in properly simplied systems need fewer concepts and heuristics [44].

Fundamentals

A SR system can typically be divided into three parts as shown in figure 1. The front-end is the processing of the raw speech and then obtaining a set of speaker discriminate features which represent the speaker’s characteristics (section 4). The back-end is the modeling and decision-making, training a speaker model using the extracted features (section 5) and decision-logic model is used to produce recognition scores by comparing features from different utterances (section 6). As stated above, the latest end-to-end neural speaker recognition systems were proposed which combining the above two components (front-end and back-end) (section 7).

The basic principles of SR are shown in the figure 2, the top figure is the enrollment process, while the below figure is the recognition process. The function of feature extraction module is to transforming the raw signal into feature vectors. In the enrollment module, the speaker module is trained utilizing the feature vectors of the tagged speaker. In the recognition module, the feature-vectors firstly extracted from the unknown speaker’s utterances are compared with the model in the database of system to giving a similarity score. The final decision of SR model is maked using scoring standard.

Feature Extraction

The process of feature extraction is to transforming the raw speech signals into some types of abstract expression, namely feature vectors, in which the properties of specific speaker are emphasized. In speaker recognition system, the features can be grouped into two categories: low-level infromation and high-level information. While all of these information conveys useful information for speaker’s identity. In the last forty years of speaker recognition, short-term and lower-level acoustic information exclusively is the most useful feature, such as cepstral features. For the high-level information, many researches have investigated the potential benefits of high-level characteristics of speech [45]. In contemporary speaker recognition applications, high-level information needs sufficient training data and very large memory. In the situation of high computational cost, the high-level features received much attention [46]. Hence the most advanced SR system still uses the low-level information. The reporter in this paper focuses on capturing the low-level information by short-term spectral features which are the simplest, yet the most discriminative.

Short-term spectral features

The most commonly features in speaker verification is the Mel frequency cepstral coefficients (MFCCs) [47], Linear prediction cepstral coefficients(LPCCs) [48], and the perceptual linear prediction coefficients(PLPs) [49]. In speaker and speech recognition system, the most fundamental process is that of extracting feature-vector of uniformly spaced across time from the time-domain sampled acoustic waveform, the processes as follows:

1. Pre-emphasis: Pre-emphasis is essentially a high-pass filter which applied to the waveform: $y (t) = x (t) - 0.97 x (t - 1)$ where $x (t)$ is the input speech data and $y (t)$ is the output.The purpose of pre-emphasis is to emphasises the higher frequencies and flattens the spectrum of the signal. Meanwhile, pre-emphasis can elimanate the effects of vocal cords and lips during vocal production.

2. Framing: The frame is the collection of N sampling points. The purpose of framing is to divided the time-domain waveform into overlapping fixed duration segments, and typical the duration values of a frame is from 20 ms to 30 ms (usually 25 ms). In order to avoid large changes between adjacent frames, there will be an overlap between two adjacent frames. Usually, the values of overlap are about 1/2 or 1/3 of a frame.

3. Windowing: In order to increase the continuity of the left and right sides of the frame, each frame is multiplied by a window function. The window functions usually include hamming window, hanning window, and rectangular window, hamming windows are usually used. Suppose the signal is $S (n), n = 0, 1, \dots N - 1, N$ after framing, where the N is the length of a frame. Then each frame is multiplied by hamming windows, $S^{'} (n) = S (n) \times W (n)$ , where

$W (n) = 0.54 - 0.46 \times \cos [\frac{2 π n}{N - 1}], 0 \leq n \leq N - 1$

Mel frequcency cepstral coefficients (MFCCs): MFCC has been widely used to capture the speech-specific characteristics for decades in speech processing. MFCC features are derived as follows:

1) The transformation of the signal in the time domain is usually difficult to observe the characteristic of the signal, the frame of N samples in the time domain is transformed to the frequency domain. FFT provides a faster implementation of the Discrete Fourier Transform (DFT). On the setting of N samples DFT will have the following coefficients.

The cavity preparations were scanned using the CEREC-Omnicam (Dentsply Sirona, York, USA) that in the scope of an obligatory and summative OSPE examination of students in their 6th study semester, was submitted as examination material (in the winter semester 2013/2014 and summer semester 2014) and graded by the trainers.

This dealt with distal occlusion preparations for ceramic inlays in premolar teeth. These examinations lay three semesters behind at the time of the study, so that the study participants (examiners) had no memory of either the grades provided, nor of the students whose examination papers these represented.

Assessment Tool

The assessment of the cavities occurred by means of checklists comprised by Schmitt et al. 2016 1, in support of the study. These incorporated five items (1. preparation edge / outer edges, 2. surface & smoothness / inner edges, 3. width & depth, 4. slide-in direction, 5. outer contact positioning and 6. overall grade). The individual assessments (Table 1) were indicated on a Likert Scale of 1 = excellent, 2 = very good, 3 = good, 4. = satisfactory to 5 = unsatisfactory (Table 3, Figures 1-7). After completion of the assessments, the examiners were questioned on general matters (n = 3), by means of an evaluation questionnaire containing 33 items, such as age, gender, teaching experience, the application concept of the digital-analytical software (n = 17), individual assessment preferences (n = 3), study procedure (n = 10) (Tables 6 and 7). Freely-composed commentaries rounded off the evaluation questionnaire.

Procedure

By means of the Wilcoxon-Matched-Pairs-Test using the Bonferroni version, a case number of n=60 was determined from the results of a preceding train-the-teacher event at α=0.0125 and a probability of P(X+X’>0)=0.25, in order to guarantee a power of 80% for four trainers.

The cavities were randomly allocated to both groups (Parts A and B) of the experiment. The randomisation took place by entering coded models into an online randomizer (https://www.random.org).

Video-based Assessment

The composition of the video of the digitalised teeth was created in the so-called analysis mode of the prepCheck software (Dentsply Sirona, York, USA) and followed by the free-of-charge programme ‘Screencast-O-Matic’ (Softonic International, Barcelona, Spain. Version 2.0). The duration of the individual videos encompassed 122 seconds on average, while they portrayed six different settings that were selected in the run-up of the prepCheck software (Dentsply Sirona, York, USA) (Table 3). A beamer and a screen, as well as a connection to a laptop were required for the videos. The environment of the room for both scenarios (Part A and B) is represented in (Figures 8,9). In Part B, the participants (examiners) agreed on unanimous assessment conditions, in regards to the enlargement aids used (2.7 x with light).

For Part A (control group with a prepCheck video), the participants could enter their assessment questionnaires, while the video was played (Image 1). For this, they had maximum 120 seconds time. The plastic tooth 15 to be assessed that was built-in to a simulation model, indicated an occlusal width of approx. 0.7 x 0.9 cm. The size of the tooth on the screen comprised an average of approx. 50 x 70 cm, which encompassed an approx. 75 x enlargement. For Part B (study group with prepCheck video + consequent models), the preparations to be assessed were maintained in models (tooth model, KaVo Dental GmbH, Biberach, Germany) on a table (Figure 2). At every seat, basic dental utensils (a mirror, probe) were provided that included a lead pencil and cotton wool buds. The examiners used the model with the corresponding reference number and the filled-in checklist with the corresponding individual assessment from Setting A. They examined the already available individual grade and modified these, where necessary. For the assessment, the teeth could be taken out of the models and the preparation edges marked with a lead pencil, where necessary. This was meant to assist in more easily recognising undesired bevels during the preparation. Before completion of the assessment time and prior to their being passed onto the next examiner, the cavities had to be cleaned with a moist cotton wool bud. Maximum 120 seconds was foreseen for the assessment of each model. In the background, a count-down timer ran above the beamer that could be viewed by all participants (Figure 2).

The case number calculation took place in co-operation with the Institute of Biostatics and Mathematical Modelling, in Frankfurt-on-Main. The assessment of the results occurred by means of the statistic programmes SAS 9.2 (SAS Institute Inc., Cary, USA, PROC MIXED) and R (Version 2.15, Package lme4). Basic data was retrieved and an analysis of the similarity of the mean values carried out between the observers (ANOVA for dependent observations, as the same models were used).

Finally, the inter-correlations of the assessments of the four raters were calculated among each other. For the comparison of the assessments in Part A and B, each of the four observer ratings were determined and both parts (A and B) tested using a ttest for paired samples. In order to determine the overall reliability of both scenarios, the six single-assessment parameters were complemented by a further ‘mean’ variable. In addition, a test was carried out to determine the differences between both alpha values for Parts A and B, followed by the reliability test for the ‘mean’ of the grades of both scenarios.

The statistical assessment was carried out in co-operation with the Competence Centre for Examinations in Medicine, Baden-Württemberg of the Medical Faculty, Heidelberg.

Results

Collected preparation parameters and inter-rater correlations

The descriptive, statistical assessment of the individual assessment providing the mean value, standard deviation, median, minimum and maximum, as well as the calculation of the reliability, took place simultaneously for all criteria (‘mean’), i.e. separately from one another in regards to the ‘preparation edge/outer edges’, ‘surface & smoothness/inner edges’, ‘slide-in direction’, ‘outer contact positioning’, ‘width & depth’ and ‘overall grade’ (Table 4). In conclusion, the following results can be summarised in the following way: the assessments of the individual criteria and overall grade were in the control group on average lower (i.e. better) than in the study group (prepCheck video + consequent model), however with one exception that showed no statistical significance. For the assessment of the parameter ‘outer contact positioning’, the alpha significantly rose from 0.56 (Part A) to 0.74 for Part B. The results of the inter-rater correlations are outlined in (Table 4).

Assessment questionnaire

All distributed assessment and evaluation questionnaires were returned after being filled-in. The exclusion rate lay at 0%. The indications on the included study populations are to be taken from (Table 5). The results of the evaluation can be viewed in (Tables 6,7). An excerpt from the freely-composed commentaries is to be taken from (Table 8).

Table 7: Result of questions 5 and 7 up to 24, assessed on a scale of 1 to 5, where 1='fully agree',2='partially agree';3='uncertain';4='do not agree';5='do not agree at all'(MV='mean value'; SD='standard deviation';Med='median';Min='minimum';Max='maximum').
Evaluation	MV	SD	Med	Min	Max
Question 5: I feel the introductory event with the prepCheck was a good preparation.	1	±0.00	1	1	1
Question 7: In my opinion, the dental examiners cannot be replaced by prepCheck for the assessment of cavities	1	±0.00	1	1	1
Question 8: I recommend prepCheck to colleagues.	1.87	±0.95	1.75	1	3
Question 9: I feel working with prepCheck is useful for calibrating trainers.	1.87	±0.95	1.75	1	3
Question 10: I feel that one should assess cavity preparations composed in the scope of examinations by means of prepCheck.	3	±1.41	2.5	2	5
Question 11: Please provide an assessment of prepCheck in form of a school grades: 1= excellent to 6=unsatisfactory	2.87	±0.89	2.75	2	4
Question 12: When assessing preparations, I like to look at images.	2.5	±1.11	2.75	1	4
Question 13: When assessing preparations, I like to look at videos.	2.5	±1.11	2.75	1	4
Question 14: For the assessment of preparartions, I feel it is important to get advice from colleagues.	2	±0.93	2	1	4
Question 15: I was able to concentrate well enough during the provision of grades.	2	±0.50	1	1	3
Question 16: I was able to easily recognise incorrect representations (such as scan errors, incorrectly-placed preparation in a row.	2.5	±1.36	2	1	5
Question 17: The number of preparations in a row was just right.	1.12	±0.25	1	1	2
Question 18: I would have preferred to have assessed less preparations in a row.	3.87	±0.54	3.75	3	5
Question 19: The representation on the screen was easy to recognise.	1	±0.00	1	1	1
Question 20: The lighting was sufficient.	1	±0.00	1	1	1
Question 21: I would have prefferred more time for the assessment of videos.	4.5	±0.58	4.5	4	5
Question 22: I would have prefferred more time for the assessment of models.	4.5	±0.58	4.5	4	5
Question 23: The intervals were long enough.	2.5	±1.82	2.25	1	5
Question 24: On the whole, the procedure was very manageable.	1	±0.00	1	1	1

Discussion

This study establishes evidence to support the reliability of video-based assessments of operative competency in performing cavity preparations in dentistry. To the best of our knowledge, this is the first study to prospectively compare two different settings of video-based assessments of cavity preparation performance using predefined checklists.

The reliability of this study lay at an average of α=0.79 (Part B: study group) and at α=0.83 (Part A: control group). In other literature, one can find reliability values in the form of Cronbach’s α of around 0.5 for examinations using CAD systems [24-26]. The reliability for OSPE without CAD systems, on the other hand, is depicted between the range of α=0.68 and α=0.87 [1, 10, 27]. The current experimental study, thereby, is more closely aligned with the results of these latter results. Due to the reliability values determined, the setting of Part A could be applied to ‘high-stake’ examinations. Part B only lies slightly below the value of α = 0.8 and requires an additional assessment step beyond the models. Thus, the purely video-based assessment appears far more suitable. For an OSPE, this would mean that one would save on further dental personnel during the examination and could reach a consensus on the grades through four examiners and by means of videos, in considerably less time after the OSPE. This would, however, require the cavities to be scanned-in prior to the assessment. The extent to which the situation of the inlay preparation could also be applied to other examination material, such as, for example, the provision of fillings, would have to be discussed in further studies. In dissent situations surrounding the assessment, digital models (scans) could be useful, particularly, where the assessed working steps would have to be ‘hidden’ throughout the course of an examination. This occurs, for example, when during the application of a restoration, the preceding preparation is ‘hidden’ before the end of an under-filling or at the end of a filling.

As a result, participants, on average, assessed Part A (‘control group’ with prepCheck video) with lower values (i.e. better grades) than Part B (‘study group’ with prepCheck video + consequent model). The overall grade of the control group ended up being 0.39 lower than in the study group. Here, it was interesting to see that the results of the study group merely deviated by 0.07 grade points from the overall grade in the real, live-performed OSPE. Thus, this setting appeared to depict the examination situation most clearly. This can be explained by the fact that the assessment of the live OSPE is equally performed with the aid of models, which resulted in procedural uniformity being detected here.

In studies on video-based examinations, some reliability data is provided in the form of ICC (interclass correlation coefficients) values. LAEEQ, CHEN and SCAFFIDI report an ICC of 0.62 [12, 14, 15]. KATEEB ICC values of 0.47 ≤ r ≤ 0.78 [28], provided in publications on CAD systems; inter-rater correlations of 0.17 ≤ r ≤ 0.56 are mentioned by ESSER [29]. URBANKOVA determined an ICC value of 0.69 ≤ r ≤ 0.90 [30]. This experimental study is therefore closest aligned to ESSER [29]. KATEEB [28], LAEEQ and CHEN [14, 15]. The statement by SAMPAIOFERNANDES that there is a lot of deviation between individually implemented examiners [31], is in any case applicable, which also occurs in this study, regardless of whether this problem was tried to be counteracted through the train-the-teacher events. The effects of the training were less than optimal, however, so that a greater need for more information and practice would have been required above all concerning the parameters ‘slide-in direction’ and ‘outer contact positioning’. The fact that the outer contacting positioning correlated to low ICC values within the control group, i.e. that were therefore exclusively assessed on grounds of the prepCheck videos, is not surprising. For, in inlay preparations, the outer contact positioning is conceivable, due to the given extension surfaces and therefore the relation to more difficult conditions for scanning the cavities. These areas would certainly be easier to demonstrate in full-crown preparations. Were additional models provided for the assessment, the ICC values doubled, as the scanning no longer played a role here and one could assess the outer contact positioning i.e. correct the assessment, better. Here, the software would have to be improved on part of the manufacturer. In addition, a significant increase of Cronbach’s alpha occurred in Setting B, when the ‘outer contact positioning’ was evaluated. This is also not surprising, as one was in the position of assessing these areas more carefully on the model. Appropriately, the study participants assessed the possibility of being able to assess the approximate outer contact positioning via prepCheck at a mean average of 4.12 ± 0.54. The assessment of the form of the cavity edge and slide-in direction, on the other hand, however, appear to represent clear indications of the analysis software. The study participants identified a further advantage of the examined analysis tool through the process of the calibration of their colleagues and perceived each of their applications with a mean of 1.87 ± 0.95. It is generally regarded as fundamentally important, however, to primarily perform the assessment by use of the analysis tool for examinations (3.00 ± 1.41). It is not surprising, that it is generally agreed that “dental assistants cannot be replaced by prepCheck when assessing cavities” (1.00 ± 0.00). For, the sole use of digital analysis tools in the current valid version alone, may require critical parameters in the grading, such as, for example, to insufficiently depict an image of the outer contact positioning. The overall assessment of the prepCheck analysis tool, ended up being a rather modest at 2.87 ± 0.89 (on a Likert scale of 1 = excellent to 6 = unsatisfactory) and points to the above-mentioned problematic areas that can certainly be optimised on part of the software.

In order to reduce the limitations of the study, various points were considered. First of all, the order of the displayed videos and models was randomised by means of an online randomiser. As the variable of the experimental parts, i.e. examination teeth, was independent to the participants, a ‘selection effect’ did not take place. Secondly, the study took place with the same four study participants in both parts, at the same time (13:20) and in the same time frame (approx. two hours and 27 minutes), using the same procedure in the same rooms. Thirdly, the lighting of both of these settings were equally also the same, as well as also the duration of the videos (2 mins 0-10 secs) and the sequence of the settings portrayed in the individual films. Furthermore, it was taken into consideration that the participants were selected from the trainers of the department of operative dentistry, who were already actively taking part in practical preparation exercises (phantoms course for the study of conservative dentistry) while in their sixth semester of study and proved to have assessment experience. In order to reduce the problematics of the lack of realistic representation, it was attempted to perform the study in such a way that it reflected the circumstances of examination as closely as possible. In this way, the duration of the live-assessment of a cavity preparation was determined in preliminary studies and the assessment questionnaires compared to the checklists familiar from the examinations [1, 27]. In order to eliminate the problem of generalisation occurring through the differing teaching experience, it was attempted to calibrate the assessment of the cavities in the preceding train-the-teacher events. Despite this, the following limitations should be taken into consideration: it is conceivable that when assessing a model (Part B), the evaluation was generally more rigid, as the preliminary grades from after the first part were already known. It is also possible that in scope of the whole experimental part, a practice-effect took place that became evident to each individual assessor to a different degree. This could explain why, despite the preceding train-the-teacher events, the inter-rater reliability differed. The influence of gender, age and teaching experience of the subject group was not a main part of this examination, although it could well be addressed in future studies.

Conclusion

1. This examination illustrates an average reliability of α = 0.833 in the assessment mode control group (Part A) that supersedes the demands for practical examinations (α ≥ 0.6) and also encompasses the general requirements of ‘highstake’ examinations of α ≥ 0.8. In Part B, a reliability of α = 0.797 was determined, without this being of specific significance to the control group.

2. The overall assessment did not significantly differ between both examination groups (Parts A and B). In the ‘outer contact positioning’ parameter, however, significant differences could be determined between A and B.

3. The ICC values with a mean average of 0.43 < r < 0.74 for the control group assessment mode (Part A) are higher than in the study group assessment mode (Part B) with 0.35 < r < 0.60. The ICC values of the ‘slide-in direction’ and ‘outer contact positioning’ criteria of the assessment mode of the control group (Part A) are minimal. The maximum reliability of the criteria of ‘preparation edge’, ‘surface’, ‘width & depth’ and ‘outer contact positioning’ in the assessment mode of the control group (prepCheck video) is acceptable at α > 0.7.

4. The assessment of the study participants in regards to the application concept of the digital-analytic software and study procedure generally proved to demonstrate a positive tendency.

References

Copyright

© 2018 Wälter A, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

View Similar Articles

Share your thoughts and experiences

Order for reprints

Article Alerts

Subscribe to our articles alerts and stay tuned.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Quick Enquiry

Table 1: schedule and procedure of both train -the-teacher events preceding the experimental study. Five model videos were assessed for each round(min=minute).
Train-the-teacher no:1			Train-the-teacher no:2
Time	Duration	Task	Time
15:15-15:17	2 min	introduction, clarification of procedure	13:37-13:38
15:17-15:23	6 min	video of demo model for practice and explanation, clarification of questions	13:38-14:43
15:23-15:39	16 min	video assessment(Round 1)	13:43-14:15
15:39-15:44	5 min	Consensus on provision of grades	14:15-14:23
-	-	break	14:23-14:36
-	-	video assessment(Round 2)	14:36-14:55
-	-	Consensus on provision of grades	14:55-15:12
15:44-15:49	5 min	Evaluation	15:12-15:17
	34 min	Total duration

Table2: Schedule and procedure of the experimental study includilng part A and part B(min=minute).
Experimental study cavities 1-30			Experimental study cavities 31-60
Time	Duration	Task	Time	Duration
13:20-13:21	1 min	Introduction	13:20-13:21	1 min
13:21-13:25	4 min	Demo of a model as a reminder,clarificaton of questions	13:21-13:26	5 min
13:25-14:10	45 min	10 videos(part A)+10 models(part B)	13:26-14:11	38 min
14:10-14:15	5 min	Break	14:11-14:23	12 min
14:15-15:00	45 min	10 videos(part A)+10 models(part B)	14:23-15:00	37 min
15:00-15:05	5 min	Break	15:00-15:14	14 min
15:05-15:50	45 min	10 videos(part A)+10 models(part B)	15:14-15:42	28 min
15:50-15:55	5 min	Evaluation	15:42-15:47	5 min
	2 hr 35 min	Total duration		2 hr 27 min

Table 3: Sequence and duration of the parameter settings in the prepCheck video and associated items of the checklist, for the assessment of grades(min=minimum,s=seconds).
Step	setting in prepcheck	Grade assessment criteria	approx.duration in video(seconds)
1	(zoom in to cavity)	(first view of model)	3
2	preparation edge	1.prepration edge(outer edges)	17
3	surface consistency, undercut	2.surface & smoothness(inner edges	31
4	side section images d-m	3.width & depth	60
5	undercut	4.slide-in direction	6
6	undercut	5.Outer contact positioning	5
		total duration	2 min 02 sec

Table 4: Result of the individually-examined criteria of the control and study groups(ICC='inter-rated correlations',MV='mean value',SD='standard deviation', med='median', min='minimum',max='maximum',*='significant difference').
Group	parameter	MV	SD	med	min	max	cronbach's alpha	ICC(min-max)
Controls(Teil A)	All	2.98	±0.55	3	2	5	0.833	0.43-0.74
	Preparation/outer edges	2.98	±0.86	3	1	5	0.778	0.37-0.60
	Surface & smoothness/inner surfaces	2.83	±0.68	2.75	2	5	0.737	0.24-0.64
	slide-in direction	2.8	±0.69	2.75	1	5	0.498	0.10-0.30
	Outer contact positining	2.74	±0.73	2.87	1	5	0.564	0.06-0.37
	width & depth	3.23	±0.88	3.12	1	5	0.789	0.36-0.64
	Overall grade	3.29	±0.84	3.25	1	5	0.768	0.26-072
Study(part B)	All	3.28	±0.54	3.25	3	5	0.793	0.35-0.60
	Preparation/outer edges	3.2	±0.93	3.5	1	5	0.705	0.17-0.57
	Surface & smoothness/inner surfaces	3.14	±0.72	3	2	5	0.727	0.34-0.45
	slide-in direction	2.97	±0.71	2.75	2	5	0.584	0.19-0.39
	Outer contact positining	3.14	±0.89	3.37	1	5	0.741	0.22-0.62
	width & depth	3.55	±0.90	3.62	2	5	0.736	0.35-0.53
	Overall grade	3.68	±0.89	3.75	2	5	0.732	0.29-0.54

Table 5: Data on subject group(consisting of four examiners: one man and three women; MV='mean value', SD='standard deviation', med='median',min='minimum',max='maximum')
Group population	MV	SD	Med	Min	Max
…(age) in years	38.75	±10.44	37.25	29	52
…years of teaching experience	12.25	±10.65	10.75	2.5	25
Have used prepCheck… times	4.75	±1.260	4.5	3	7

Trends in Computer Science and Information Technology

An overview of speaker recognition