ISSN: 2641-3086
Trends in Computer Science and Information Technology
Review Article       Open Access      Peer-Reviewed

An overview of speaker recognition

Junxia Liu1, CL Philip Chen1,2*, Tieshan Li1*, Yi Zuo1 and Peichao He1

1Dalian Maritime University, Dalian 116026, China
2University of Macau, Macau 99999, China
*Corresponding author: CL Philip Chen and Tieshan Li, Department of Navigation Institute, DaLian Maritime University, Linghai Road Ganjingzi District, Dalian, Liaoning, Room 518, China, Tel: 010-0411-84729255; E-mail: liujunxiadlmu@163.com
Received: 19 July, 2019 | Accepted: 26 August, 2019 | Published: 28 August, 2019
Keywords: Speaker recognition; Feature extraction; MFCC; Deep learning; End-to-end model

Cite this as

Liu J, Philip Chen CL, Li T, Zuo Y, He P (2019) An overview of speaker recognition. Trends Comput Sci Inf Technol 4(1): 001-012. DOI: 10.17352/tcsit.000009

Speaker recognition has been studied for many years and has been a hot topic. This paper presents an overview of speaker recognition methods, which include the classical and the state-of-art methods. According to the modular components of speaker recognition system, we firstly introduced the fundamentals of speaker recognition, which are mainly divided into two parts: feature extraction and speaker modeling. The most commonly speech features used in speaker recognition were elaborated firstly. In particular, the recent progress of deep neural network proposes a new approach of feature extraction and has become the technology trend. Secondly, the classical approaches of speaker recognition model were introduced, and elaborated the recent progress of deep learning speaker recognition. This paper especially provides an in-depth analysis on end-to-end model which consists of a training component to extract features, an enrollment component to training the speaker model, and an evaluation component with appropriate loss function for optimization. The final part concludes the paper with discussion on future trends.

Introduction

Speaker recognition bas become the most popular methods in biometric identification field, because the voice is the most common signal and the simplest to acquire [1,2]. With the wide application of artificial intelligence machines, researchers have found that voice communication is the best way to communicate between humans and machines. In access control field, telephone services of transaction authorization field, and speaker diarization field, speaker recognition has been applied extensively [3,4]. In general, the speaker recognition system can fall into two categories: speaker identification (SI) and speaker verification (SV). Speaker identification is the process to determining who is talking from a group of people, and the system must perform a 1: N classification. Speaker verification is the task to determining whether a person is who he/she claims to be (a yes/no decision). Generally, speaker identification can be divided into “closed-set” and “open-set”, it is supposed that the training voice come from a fixed set of know speakers, thus the task is referred to as closed-set identification. Instead, it is assumed that training voice are not known to the system, that is refered to as open-set identification [5-7].

The speech used for speaker recognition can be grouped into text-dependent (TD) and text-independent (TI). In the text-dependent (TD) application, the recognition system has prior knowledge of the text to be spoken and it is expected the recognition must be pronounced according to the prior knowledge. Because of prior knowledge, the text-dependent (TD) recognition can greatly improve performance of recognition system. In a text-independent (TI) application, there is no fixed text for the recognition system to be spoken. Since there is no prior knowledge, text-independent speaker recognition is more difficult but also more flexible. As the speech recognition accuracy improved and the speaker and speech recognition system rapid develop, the distinction between text-independent and text-dependent application have been decreased [8-13].

The methods of speaker recognition’s research and development have been studied over five decades, which still is an active area. The methods of speaker recognition are spanned from the original human aural spectrogram comparisons to simple template matching, to dynamic time-warping approaches, to more modern statistical pattern recognition approach and to the most popular deep learning in recent years.[14-18]. Especially noting that, the methods applied to speech recognition have also been often used in speaker recognition (SR). The corpora’ research and development are from small, private corpora to large, open source corpora. This domain has natured to the degree that commericial applications of SR have been growing steadily since the mid-1980s, and many large companies have this technology, such as Google, Baidu, IBM and Microsoft etc have set up speech research groups.

This paper is a general overview of speaker recognition technologies, that introduce the classic techniques from 1987 until today. Meanwhile, we focus on the recent techniques that shifted from deep neural network models to end-to-end models. The remaining of this overvies is organized as follow: section 2 introduce the development of speaker recognition. Section 3 introduce fundamentals of speaker recognition. Section 4 and 5 elaborate feature extraction and speaker modeling process. Section 6 is then devoted to the decision method. The end-to-end model was introduced with emphasis in section 7. Finally, the conclusion and the future research trends of recognition technology are outlined in Section 8.

Overview

Speaker recognition can be put into four stages. The first stage was from 1960s to 1970s, the research focused on feature extraction and template matching technique. In 1962, Kesta at Bell LABS proposed spectrogram method for speaker recognition [19]. In 1969, Luck proposed Cepstrum technology [20]. In 1976, Atal et al. proposed the Linear Predictive Cepstrum Coefficients (LPCCs), which improved the accuracy of speaker recognition [21]. In terms of the model, template matching was mainly adopted in the 1960s. In the 1970s, Dynamic Time Warping (DTW) and Vector Quantization (VQ) technology became the mainstream.

The second stage was from the 1980s to the 1990s, speech statistical models are beginning to be applied to speaker recognition [22-24]. In terms of feature extraction, Davis proposed Mel-Frequency cepstrum parameter (MFCC) for speaker recognition which becomes the mainstream feature in the following years [25]. In terms of models, the classical approach has been devided into two types. The first types are based on vector quantization and dynamic time wrapping, which are referred to as template-based models. The second types are stochastic models which based on Gaussian Mixture Model (GMM) [26] or Hidden Markov Model (HMM) [27,28]. The majority of the state-of-the-art SR systems adopted MFCC as features and Gaussian mixture model (GMM) was used for speaker modeling [29-31]. The Gaussian Mixture Model has been proved extremely successful in TI speaker recognition.

The third stage around 2000, GMM-based speaker recognition methods has been the most commonly used and which proposed by Reynolds, which include the classical Maximum a-Posteriori (MAP) adaptation of universal background model parameters (GMM-UBM) [32] and support vector machine (SVM) classification of GMM supervectors (GMM-SVM) [33].

In the training phase, the MAP adaptation framework provides a way of incorporating prior information by adapting the parameters of GMM from the UBM. The framework is available in dealing with problems posed by sparse training data [34]. The SVM uses a non-linear function to map data on to a higher (possibly infinite) dimensional space and then finds the best hyper-plane separating the two classes in this space [34,35]. Since the SVM is basically a two-class classifier.

However, in terms of data, the high accurate rate only can be achieved under ideal conditions and is appropriate for practical application under matched channel conditions. Instead, the performance can degrade significantly under mismatched conditions. After 2010,

The barrier associated with compensating for these differences have offered an active research focus for the SV field and some of the most advanced channel compensation schemes include joint factor analysis (JFA) [36], i-vectors [37], or nuisance attribute projection (NAP) [38]. Meanwhile, using the fusion information from different sourec of evidence, which can improve system performance.

The fourth stage began in the early 2000s (2010), deep learning promotes the development of speaker recognition. At this stage, the development of speaker recognition technology is drivern by the commercial needs, while deep learning, big data and genetal graphics computing uint (GPU) also promote the development of speaker recognition. The various deep neural networks based were proposed for speaker cognition methods [39]. At the feature extraction of frame-level, researchers apply deep neural networks to extract Bottlenect (BN) features [40], d-vector [41], j-vector [42], and x-vector [43]. At the model-level, the research focuses on various deep neural networks (DNNs) for acoustic feature modeling. The decisions were maked utilize the distance between the target feature vector and the test feature vector. But speakers are often unknown during system training, this makes a big challenging for SR.

Whether the i-vector system, or feature vector extracted from DNN system, it usually consists of three modules, the first is the training module which calculate the representations of speaker, the second is enrollment module which estimate the speaker model, the last is evaluation module which have an appropriate loss function for optimization. A new approach has been proposed in which all the modules can be jointed together. Compared with the present methods, such an end-to-end(E2E) method direct modeling utterances and directly joint estimation, which result in better and more compact models. Moreover, this approach ofter results in properly simplied systems need fewer concepts and heuristics [44].

Fundamentals

A SR system can typically be divided into three parts as shown in figure 1. The front-end is the processing of the raw speech and then obtaining a set of speaker discriminate features which represent the speaker’s characteristics (section 4). The back-end is the modeling and decision-making, training a speaker model using the extracted features (section 5) and decision-logic model is used to produce recognition scores by comparing features from different utterances (section 6). As stated above, the latest end-to-end neural speaker recognition systems were proposed which combining the above two components (front-end and back-end) (section 7).

The basic principles of SR are shown in the figure 2, the top figure is the enrollment process, while the below figure is the recognition process. The function of feature extraction module is to transforming the raw signal into feature vectors. In the enrollment module, the speaker module is trained utilizing the feature vectors of the tagged speaker. In the recognition module, the feature-vectors firstly extracted from the unknown speaker’s utterances are compared with the model in the database of system to giving a similarity score. The final decision of SR model is maked using scoring standard.

Feature Extraction

The process of feature extraction is to transforming the raw speech signals into some types of abstract expression, namely feature vectors, in which the properties of specific speaker are emphasized. In speaker recognition system, the features can be grouped into two categories: low-level infromation and high-level information. While all of these information conveys useful information for speaker’s identity. In the last forty years of speaker recognition, short-term and lower-level acoustic information exclusively is the most useful feature, such as cepstral features. For the high-level information, many researches have investigated the potential benefits of high-level characteristics of speech [45]. In contemporary speaker recognition applications, high-level information needs sufficient training data and very large memory. In the situation of high computational cost, the high-level features received much attention [46]. Hence the most advanced SR system still uses the low-level information. The reporter in this paper focuses on capturing the low-level information by short-term spectral features which are the simplest, yet the most discriminative.

Short-term spectral features

The most commonly features in speaker verification is the Mel frequency cepstral coefficients (MFCCs) [47], Linear prediction cepstral coefficients(LPCCs) [48], and the perceptual linear prediction coefficients(PLPs) [49]. In speaker and speech recognition system, the most fundamental process is that of extracting feature-vector of uniformly spaced across time from the time-domain sampled acoustic waveform, the processes as follows:

1. Pre-emphasis: Pre-emphasis is essentially a high-pass filter which applied to the waveform: y(t)=x(t)0.97x(t1) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEaiaacIcacaGG0bGaaiykaiabg2da9iaadIhacaGGOaGaaiiDaiaacMcacqGHsislcaaIWaGaaiOlaiaaiMdacaaI3aGaamiEaiaacIcacaGG0bGaeyOeI0IaaGymaiaacMcaaaa@4668@ where x(t) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiEaiaacIcacaGG0bGaaiykaaaa@3940@ is the input speech data and y(t) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyEaiaacIcacaGG0bGaaiykaaaa@3941@ is the output.The purpose of pre-emphasis is to emphasises the higher frequencies and flattens the spectrum of the signal. Meanwhile, pre-emphasis can elimanate the effects of vocal cords and lips during vocal production.

2. Framing: The frame is the collection of N sampling points. The purpose of framing is to divided the time-domain waveform into overlapping fixed duration segments, and typical the duration values of a frame is from 20 ms to 30 ms (usually 25 ms). In order to avoid large changes between adjacent frames, there will be an overlap between two adjacent frames. Usually, the values of overlap are about 1/2 or 1/3 of a frame.

3. Windowing: In order to increase the continuity of the left and right sides of the frame, each frame is multiplied by a window function. The window functions usually include hamming window, hanning window, and rectangular window, hamming windows are usually used. Suppose the signal is S(n),n=0,1,N1,N MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4uaiaacIcacaGGUbGaaiykaiaacYcacaGGUbGaeyypa0JaaGimaiaacYcacaaIXaGaaiilaiabl+Uimjaac6eacqGHsislcaaIXaGaaiilaiaac6eaaaa@447C@ after framing, where the N is the length of a frame. Then each frame is multiplied by hamming windows, S ' (n)=S(n)×W(n) MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaam4uamaaCaaaleqabaGaai4jaaaakiaacIcacaGGUbGaaiykaiabg2da9iaadofacaGGOaGaaiOBaiaacMcacqGHxdaTcaGGxbGaaiikaiaac6gacaGGPaaaaa@435D@ , where

W(n)=0.540.46×cos[ 2πn N1 ],0nN1 MathType@MTEF@5@5@+=feaaguart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLnhiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=xfr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaai4vaiaacIcacaGGUbGaaiykaiabg2da9iaaicdacaGGUaGaaGynaiaaisdacqGHsislcaaIWaGaaiOlaiaaisdacaaI2aGaey41aqRaci4yaiaac+gacaGGZbWaamWaaeaadaWcaaqaaiaaikdacqaHapaCcaWGUbaabaGaamOtaiabgkHiTiaaigdaaaaacaGLBbGaayzxaaGaaiilaiaaicdacqGHKjYOcaWGUbGaeyizImQaamOtaiabgkHiTiaaigdaaaa@55F3@

Mel frequcency cepstral coefficients (MFCCs): MFCC has been widely used to capture the speech-specific characteristics for decades in speech processing. MFCC features are derived as follows:

1) The transformation of the signal in the time domain is usually difficult to observe the characteristic of the signal, the frame of N samples in the time domain is transformed to the frequency domain. FFT provides a faster implementation of the Discrete Fourier Transform (DFT). On the setting of N samples DFT will have the following coefficients.

The cavity preparations were scanned using the CEREC-Omnicam (Dentsply Sirona, York, USA) that in the scope of an obligatory and summative OSPE examination of students in their 6th study semester, was submitted as examination material (in the winter semester 2013/2014 and summer semester 2014) and graded by the trainers.

This dealt with distal occlusion preparations for ceramic inlays in premolar teeth. These examinations lay three semesters behind at the time of the study, so that the study participants (examiners) had no memory of either the grades provided, nor of the students whose examination papers these represented.

Assessment Tool

The assessment of the cavities occurred by means of checklists comprised by Schmitt et al. 2016 1, in support of the study. These incorporated five items (1. preparation edge / outer edges, 2. surface & smoothness / inner edges, 3. width & depth, 4. slide-in direction, 5. outer contact positioning and 6. overall grade). The individual assessments (Table 1) were indicated on a Likert Scale of 1 = excellent, 2 = very good, 3 = good, 4. = satisfactory to 5 = unsatisfactory (Table 3, Figures 1-7). After completion of the assessments, the examiners were questioned on general matters (n = 3), by means of an evaluation questionnaire containing 33 items, such as age, gender, teaching experience, the application concept of the digital-analytical software (n = 17), individual assessment preferences (n = 3), study procedure (n = 10) (Tables 6 and 7). Freely-composed commentaries rounded off the evaluation questionnaire.

Procedure

By means of the Wilcoxon-Matched-Pairs-Test using the Bonferroni version, a case number of n=60 was determined from the results of a preceding train-the-teacher event at α=0.0125 and a probability of P(X+X’>0)=0.25, in order to guarantee a power of 80% for four trainers.

The cavities were randomly allocated to both groups (Parts A and B) of the experiment. The randomisation took place by entering coded models into an online randomizer (https://www.random.org).

Video-based Assessment

The composition of the video of the digitalised teeth was created in the so-called analysis mode of the prepCheck software (Dentsply Sirona, York, USA) and followed by the free-of-charge programme ‘Screencast-O-Matic’ (Softonic International, Barcelona, Spain. Version 2.0). The duration of the individual videos encompassed 122 seconds on average, while they portrayed six different settings that were selected in the run-up of the prepCheck software (Dentsply Sirona, York, USA) (Table 3). A beamer and a screen, as well as a connection to a laptop were required for the videos. The environment of the room for both scenarios (Part A and B) is represented in (Figures 8,9). In Part B, the participants (examiners) agreed on unanimous assessment conditions, in regards to the enlargement aids used (2.7 x with light).

For Part A (control group with a prepCheck video), the participants could enter their assessment questionnaires, while the video was played (Image 1). For this, they had maximum 120 seconds time. The plastic tooth 15 to be assessed that was built-in to a simulation model, indicated an occlusal width of approx. 0.7 x 0.9 cm. The size of the tooth on the screen comprised an average of approx. 50 x 70 cm, which encompassed an approx. 75 x enlargement. For Part B (study group with prepCheck video + consequent models), the preparations to be assessed were maintained in models (tooth model, KaVo Dental GmbH, Biberach, Germany) on a table (Figure 2). At every seat, basic dental utensils (a mirror, probe) were provided that included a lead pencil and cotton wool buds. The examiners used the model with the corresponding reference number and the filled-in checklist with the corresponding individual assessment from Setting A. They examined the already available individual grade and modified these, where necessary. For the assessment, the teeth could be taken out of the models and the preparation edges marked with a lead pencil, where necessary. This was meant to assist in more easily recognising undesired bevels during the preparation. Before completion of the assessment time and prior to their being passed onto the next examiner, the cavities had to be cleaned with a moist cotton wool bud. Maximum 120 seconds was foreseen for the assessment of each model. In the background, a count-down timer ran above the beamer that could be viewed by all participants (Figure 2).

The case number calculation took place in co-operation with the Institute of Biostatics and Mathematical Modelling, in Frankfurt-on-Main. The assessment of the results occurred by means of the statistic programmes SAS 9.2 (SAS Institute Inc., Cary, USA, PROC MIXED) and R (Version 2.15, Package lme4). Basic data was retrieved and an analysis of the similarity of the mean values carried out between the observers (ANOVA for dependent observations, as the same models were used).

Finally, the inter-correlations of the assessments of the four raters were calculated among each other. For the comparison of the assessments in Part A and B, each of the four observer ratings were determined and both parts (A and B) tested using a ttest for paired samples. In order to determine the overall reliability of both scenarios, the six single-assessment parameters were complemented by a further ‘mean’ variable. In addition, a test was carried out to determine the differences between both alpha values for Parts A and B, followed by the reliability test for the ‘mean’ of the grades of both scenarios.

The statistical assessment was carried out in co-operation with the Competence Centre for Examinations in Medicine, Baden-Württemberg of the Medical Faculty, Heidelberg.

Results

Collected preparation parameters and inter-rater correlations

The descriptive, statistical assessment of the individual assessment providing the mean value, standard deviation, median, minimum and maximum, as well as the calculation of the reliability, took place simultaneously for all criteria (‘mean’), i.e. separately from one another in regards to the ‘preparation edge/outer edges’, ‘surface & smoothness/inner edges’, ‘slide-in direction’, ‘outer contact positioning’, ‘width & depth’ and ‘overall grade’ (Table 4). In conclusion, the following results can be summarised in the following way: the assessments of the individual criteria and overall grade were in the control group on average lower (i.e. better) than in the study group (prepCheck video + consequent model), however with one exception that showed no statistical significance. For the assessment of the parameter ‘outer contact positioning’, the alpha significantly rose from 0.56 (Part A) to 0.74 for Part B. The results of the inter-rater correlations are outlined in (Table 4).

Assessment questionnaire

All distributed assessment and evaluation questionnaires were returned after being filled-in. The exclusion rate lay at 0%. The indications on the included study populations are to be taken from (Table 5). The results of the evaluation can be viewed in (Tables 6,7). An excerpt from the freely-composed commentaries is to be taken from (Table 8).

Discussion

This study establishes evidence to support the reliability of video-based assessments of operative competency in performing cavity preparations in dentistry. To the best of our knowledge, this is the first study to prospectively compare two different settings of video-based assessments of cavity preparation performance using predefined checklists.

The reliability of this study lay at an average of α=0.79 (Part B: study group) and at α=0.83 (Part A: control group). In other literature, one can find reliability values in the form of Cronbach’s α of around 0.5 for examinations using CAD systems [24-26]. The reliability for OSPE without CAD systems, on the other hand, is depicted between the range of α=0.68 and α=0.87 [1, 10, 27]. The current experimental study, thereby, is more closely aligned with the results of these latter results. Due to the reliability values determined, the setting of Part A could be applied to ‘high-stake’ examinations. Part B only lies slightly below the value of α = 0.8 and requires an additional assessment step beyond the models. Thus, the purely video-based assessment appears far more suitable. For an OSPE, this would mean that one would save on further dental personnel during the examination and could reach a consensus on the grades through four examiners and by means of videos, in considerably less time after the OSPE. This would, however, require the cavities to be scanned-in prior to the assessment. The extent to which the situation of the inlay preparation could also be applied to other examination material, such as, for example, the provision of fillings, would have to be discussed in further studies. In dissent situations surrounding the assessment, digital models (scans) could be useful, particularly, where the assessed working steps would have to be ‘hidden’ throughout the course of an examination. This occurs, for example, when during the application of a restoration, the preceding preparation is ‘hidden’ before the end of an under-filling or at the end of a filling.

As a result, participants, on average, assessed Part A (‘control group’ with prepCheck video) with lower values (i.e. better grades) than Part B (‘study group’ with prepCheck video + consequent model). The overall grade of the control group ended up being 0.39 lower than in the study group. Here, it was interesting to see that the results of the study group merely deviated by 0.07 grade points from the overall grade in the real, live-performed OSPE. Thus, this setting appeared to depict the examination situation most clearly. This can be explained by the fact that the assessment of the live OSPE is equally performed with the aid of models, which resulted in procedural uniformity being detected here.

In studies on video-based examinations, some reliability data is provided in the form of ICC (interclass correlation coefficients) values. LAEEQ, CHEN and SCAFFIDI report an ICC of 0.62 [12, 14, 15]. KATEEB ICC values of 0.47 ≤ r ≤ 0.78 [28], provided in publications on CAD systems; inter-rater correlations of 0.17 ≤ r ≤ 0.56 are mentioned by ESSER [29]. URBANKOVA determined an ICC value of 0.69 ≤ r ≤ 0.90 [30]. This experimental study is therefore closest aligned to ESSER [29]. KATEEB [28], LAEEQ and CHEN [14, 15]. The statement by SAMPAIOFERNANDES that there is a lot of deviation between individually implemented examiners [31], is in any case applicable, which also occurs in this study, regardless of whether this problem was tried to be counteracted through the train-the-teacher events. The effects of the training were less than optimal, however, so that a greater need for more information and practice would have been required above all concerning the parameters ‘slide-in direction’ and ‘outer contact positioning’. The fact that the outer contacting positioning correlated to low ICC values within the control group, i.e. that were therefore exclusively assessed on grounds of the prepCheck videos, is not surprising. For, in inlay preparations, the outer contact positioning is conceivable, due to the given extension surfaces and therefore the relation to more difficult conditions for scanning the cavities. These areas would certainly be easier to demonstrate in full-crown preparations. Were additional models provided for the assessment, the ICC values doubled, as the scanning no longer played a role here and one could assess the outer contact positioning i.e. correct the assessment, better. Here, the software would have to be improved on part of the manufacturer. In addition, a significant increase of Cronbach’s alpha occurred in Setting B, when the ‘outer contact positioning’ was evaluated. This is also not surprising, as one was in the position of assessing these areas more carefully on the model. Appropriately, the study participants assessed the possibility of being able to assess the approximate outer contact positioning via prepCheck at a mean average of 4.12 ± 0.54. The assessment of the form of the cavity edge and slide-in direction, on the other hand, however, appear to represent clear indications of the analysis software. The study participants identified a further advantage of the examined analysis tool through the process of the calibration of their colleagues and perceived each of their applications with a mean of 1.87 ± 0.95. It is generally regarded as fundamentally important, however, to primarily perform the assessment by use of the analysis tool for examinations (3.00 ± 1.41). It is not surprising, that it is generally agreed that “dental assistants cannot be replaced by prepCheck when assessing cavities” (1.00 ± 0.00). For, the sole use of digital analysis tools in the current valid version alone, may require critical parameters in the grading, such as, for example, to insufficiently depict an image of the outer contact positioning. The overall assessment of the prepCheck analysis tool, ended up being a rather modest at 2.87 ± 0.89 (on a Likert scale of 1 = excellent to 6 = unsatisfactory) and points to the above-mentioned problematic areas that can certainly be optimised on part of the software.

In order to reduce the limitations of the study, various points were considered. First of all, the order of the displayed videos and models was randomised by means of an online randomiser. As the variable of the experimental parts, i.e. examination teeth, was independent to the participants, a ‘selection effect’ did not take place. Secondly, the study took place with the same four study participants in both parts, at the same time (13:20) and in the same time frame (approx. two hours and 27 minutes), using the same procedure in the same rooms. Thirdly, the lighting of both of these settings were equally also the same, as well as also the duration of the videos (2 mins 0-10 secs) and the sequence of the settings portrayed in the individual films. Furthermore, it was taken into consideration that the participants were selected from the trainers of the department of operative dentistry, who were already actively taking part in practical preparation exercises (phantoms course for the study of conservative dentistry) while in their sixth semester of study and proved to have assessment experience. In order to reduce the problematics of the lack of realistic representation, it was attempted to perform the study in such a way that it reflected the circumstances of examination as closely as possible. In this way, the duration of the live-assessment of a cavity preparation was determined in preliminary studies and the assessment questionnaires compared to the checklists familiar from the examinations [1, 27]. In order to eliminate the problem of generalisation occurring through the differing teaching experience, it was attempted to calibrate the assessment of the cavities in the preceding train-the-teacher events. Despite this, the following limitations should be taken into consideration: it is conceivable that when assessing a model (Part B), the evaluation was generally more rigid, as the preliminary grades from after the first part were already known. It is also possible that in scope of the whole experimental part, a practice-effect took place that became evident to each individual assessor to a different degree. This could explain why, despite the preceding train-the-teacher events, the inter-rater reliability differed. The influence of gender, age and teaching experience of the subject group was not a main part of this examination, although it could well be addressed in future studies.

Conclusion

1. This examination illustrates an average reliability of α = 0.833 in the assessment mode control group (Part A) that supersedes the demands for practical examinations (α ≥ 0.6) and also encompasses the general requirements of ‘highstake’ examinations of α ≥ 0.8. In Part B, a reliability of α = 0.797 was determined, without this being of specific significance to the control group.

2. The overall assessment did not significantly differ between both examination groups (Parts A and B). In the ‘outer contact positioning’ parameter, however, significant differences could be determined between A and B.

3. The ICC values with a mean average of 0.43 < r < 0.74 for the control group assessment mode (Part A) are higher than in the study group assessment mode (Part B) with 0.35 < r < 0.60. The ICC values of the ‘slide-in direction’ and ‘outer contact positioning’ criteria of the assessment mode of the control group (Part A) are minimal. The maximum reliability of the criteria of ‘preparation edge’, ‘surface’, ‘width & depth’ and ‘outer contact positioning’ in the assessment mode of the control group (prepCheck video) is acceptable at α > 0.7.

4. The assessment of the study participants in regards to the application concept of the digital-analytic software and study procedure generally proved to demonstrate a positive tendency.

  1. Schmitt L, Möltner A, Rüttermann S, Gerhardt-Szep S (2016) Study on the Interrater Reliability of an OSPE (Objective Structured Practical Examination) – Subject to the Evaluation Mode in the Phantom Course of Operative Dentistry 33: 61. Link: https://tinyurl.com/y7wtonlj  
  2. Natkin E, Guild RE (1967) Evaluation of preclinical laboratory performance: a systematic study 31: 152-161.
  3. Jasinevicius TR, Landers M, Nelson S, Urbankova A (2004) An evaluation of two dental simulation systems: virtual reality versus contemporary non-computerassisted 68: 1151-1162. Link: https://tinyurl.com/ybl4p2yr
  4. Kournetas N, Jaeger B, Axmann D, Groten M, Lachmann S, et al. (2004) Assessing the reliability of a digital preparation assistant system used in dental education 68: 12281234. Link: https://tinyurl.com/y9run9ts   
  5. Cardoso JA, Barbosa C, Fernandes S, Silva CL, Pinho A (2006) Reducing subjectivity in the evaluation of pre-clinical dental preparations for fixed prosthodontics using the Kavo PrepAssistant 10: 149-156. Link: https://tinyurl.com/y7t2w4xh
  6. Kenneth AE (2004) E-learning—new technologies but slow progress 7: 115-117.
  7. Kikuchi H, Ikeda M, Araki K (2013) Evaluation of a virtual reality simulation system for porcelain fused to metal crown preparation at Tokyo Medical and Dental University. 77: 782-792. Link: https://tinyurl.com/y834vzrx
  8. Taylor CL, Grey NJ, Satterthwaite JD (2013) A comparison of grades awarded by peer assessment, faculty and a digital scanning device in a pre-clinical operative skills course17: 16-21. Link: https://tinyurl.com/y9wwcco2  
  9. Turnbull J, Gray J, MacFadyen (1998) Improving in-training evaluation programs. J Gen Intern Med 3: 317-323. Link: https://tinyurl.com/ya3tfd5g
  10. Gerhardt-Szep S, Güntsch A, Pospiech P, Söhnel A, Scheutzel P, et al. (2016) Assessment formats in dental medicine: An overview. GMS J Med Educ 33: Doc65. Link:  https://tinyurl.com/yd5lb8v2
  11. Zia A, Sharma Y, Bettadapura V (2016) Automated video-based assessment of surgical skills for training and evaluation in medical schools. Int J CARS 11: 1623-1636. Link: https://tinyurl.com/yddz5rfx
  12. Scaffidi MA, Grover SC, Carnahan H, Yu JJ, Yong E, et all. (2017) A prospective comparison of live and video-based assessments of colonoscopy performance. Link: https://tinyurl.com/yc5gjsoj
  13. Macluskey M, Durham J, Balmer C, Bell A, Cowpe J, et al. (2011) Dental student suturing skills: a multicentre trial of a checklist-based assessment. Eur J Dent Educ 15: 224229. Link: https://tinyurl.com/y8ta8dg7
  14. Chen AC, Lee MS, Chen WJ, Lee ST (2013) Assessment in orthopedic training-an analysis of rating consistency by using an objective structured examination video. J Surg Educ 70: 189-192. Link: https://tinyurl.com/ybykh3rc
  15. Laeeq K, Infusino S, Lin SY, Reh DD, Ishii M, et all. (2010) Video-based assessment of operative competency in endoscopic sinus surgery. Am J Rhinol Allergy. May-Jun; 24(3):234-7. Link: https://tinyurl.com/yauyald6  
  16. Podsakoff NP, Podsakoff PM, Mackenzie SB, Klinger RL (2013) Are we really measuring what we say we're measuring? Using video techniques to supplement traditional construct validation procedures. J Appl Psychol. Link: https://tinyurl.com/ybk8xv7e
  17. Sarkiss CA, Philemond S, Lee J, Sobotka S, Holloway TD, et al. (2016) Neurosurgical Skills Assessment: Measuring Technical Proficiency in Neurosurgery Residents Through Intraoperative Video Evaluations. World Neurosurg 89: 1-8. Link: https://tinyurl.com/y8znzezo
  18. Perron NJ, Louis-Simonet M, Cerutti B, Pfarrwaller E, Sommer J, et all. (2016) Feedback in formative OSCEs: comparison between direct observation and videobased formats. Med Educ Online 21: 32160. Link: https://tinyurl.com/yb7d9a45
  19. Takazawa S, Ishimaru T, Harada K (2015) Video-Based Skill Assessment of Endoscopic Suturing in a Pediatric Chest Model and a Box Trainer. J Laparoendosc Adv Surg Tech A 25: 445-453. Link: https://tinyurl.com/y6uwhuy7
  20. Massey D, Byrne J, Higgins N, Weeks B, Shuker MA, et al. (2017) Enhancing OSCE preparedness with video exemplars in undergraduate nursing students. A mixed method study. Nurse Educ Today: 54: 56-61. Link: https://tinyurl.com/y7v9jcgt
  21. Deie K, Ishimaru T, Takazawa S, Harada K, Sugita N, et al. (2017) Preliminary Study of Video-Based Pediatric Endoscopic Surgical Skill Assessment Using a Neonatal Esophageal Atresia/Tracheoesophageal Fistula Model. J Laparoendosc Adv Surg Tech A 27: 76-81. Link: https://tinyurl.com/y8skg2b6
  22. Simpson D, Helm R, Drewniak T, Ziebert MM, Brown D, et al. (2006) Objective Structured Video Examinations (OSVEs) for Geriatrics Education. Gerontol Geriatr Educ 26: 7-24. Link: https://tinyurl.com/ycul69kp
  23. Pérez-Escamirosa F, Chousleb-Kalach A,
del Carmen Hernández-Baro M, Sánchez-Margallo JA, Lorias-Espinoza D, et all. (2016) Construct validity of a video-tracking system based on orthogonal cameras approach for objective assessment of laparoscopic skills. Int J CARS 11: 2283–2293. Link: https://tinyurl.com/yb3wz2vq
  24. Weigl P, Felber R, Brandt J, König E, Lauer HC (2015) Fully automated and objective quality inspection of a clinical tooth preparation. Proceedings 40th Annual Meeting of the Association for Dental Education in Europe (ADEE). Eur J Dent Educ 19: e8–e35.  
  25. Stumpf A, Weigl P, Gerhardt T, Felber R, Heidemann D, Gerhardt-Szép (2015) Computer-aided 3D analysis of cavities via prepCheck - a pilot study. Proceedings 40th Annual Meeting of the Association for Dental Education in Europe (ADEE). Eur J Dent Educ 19: e8–e35. Link: https://tinyurl.com/yabr9pub
  26. Roopa VA (2017) The Calibration of a Software Programme to Assess Ceramic Crown Preparations in a Pre-clinical Setting.Link: https://tinyurl.com/y9t7n2un
  27. Petkov P, Knuth-Herzig K, Hoefer S, Stehle S, Scherer S, et all. (2017) The reliability and predictive validity of a sixth-semester OSPE in conservative dentistry regarding performance on the state examination. GMS J Med Educ 34. Link: https://tinyurl.com/y9gu4jqj
  28. Kateeb ET, Kamal MS, Kadamani AM, Abu Hantash RO, Arqoub MM (2016) Utilising an innovative digital software to grade pre-clinical crown preparation exercise. Eur J Dent Educ. Link: https://tinyurl.com/ydz8oc3d
  29. Esser C, Kerschbaum T, Winkelmann V, Krage T, Faber FJ (2006) A comparison of the visual and technical assessment of preparations made by dental students. Eur J Dent Educ 10: 157-161. Link: https://tinyurl.com/y77ecmm7
  30. Urbankova A (2010) Impact of computerized dental simulation training on preclinical operative dentistry examination scores. J Dent Educ 74: 402-409. Link: https://tinyurl.com/y7gfqazv
  31. Sampaio-Fernandes MA, Sampaio-Fernandes MM, Fonseca PA, Almeida PR, Reis-Campos JC, et all. (2015) Evaluation of occlusal rest seats with 3D technology in dental education. J Dent Educ 79: 166-176. Link: https://tinyurl.com/ycvykecy
© 2018 Wälter A, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
 

Help ?