Cite this asEvitts PM, Starmer H, Webster K (2021) Effects of mode of presentation and mode of speech on listener perceptions of voice, speech and personality following supracricoid laryngectomy. Arch Otolaryngol Rhinol 7(1): 001-011. DOI: 10.17352/2455-1759.000138
Background: There is a paucity of information on listener perceptions of Individuals with a Laryngectomy (IWL) based on different modes of speech, in particular, speech following Supracricoid Laryngectomy (SCL). The purpose of this study was to determine whether listeners have different perceptions of an IWL based on type of surgery, mode of speech, and mode of presentation.
Methods: 35 naïve listeners (29 female, 6 male, mean age 31.1 years) were randomly presented with recordings of a standard reading passage produced by 15 different speakers (5 modes of speech x 3 speakers each mode) in both audio-only and audio-visual presentation mode. Listeners rated each speaker using a visual analog scale (10 cm line) on factors related to personality, comfort of speech, and voice quality.
Results: A multivariate Analysis of Variance (MANOVA) showed significant differences in mode of presentation (p<.001), mode of speech (p<.001), and a significant interaction effect between mode of presentation and mode of speech (p<.001).
Conclusions: Overall results suggest the following: IWL are perceived more favorably in the audio-visual mode; normal laryngeal speakers are perceived more favorably than all modes of alaryngeal speech and esophageal speech was perceived as the least favorable across most of the factors.
Laryngeal cancer is the most common form of malignancy of the head and neck . The American Cancer Society estimates that in 2020, 12,370 new cases of laryngeal cancer will be diagnosed and that there will be 3,750 laryngeal cancer related deaths in the United States. Traditionally, the Total Laryngectomy (TL) combined with a neck dissection was the treatment of choice for advanced laryngeal cancer . However, more recent research suggests that an alternative to TL that highlights conservation surgery or ‘functional laryngeal preservation’  has also been shown to be an effective treatment [3-5]. Aside from being an effective treatment option, the SCL procedure results in dramatically different visual and acoustic changes relative to TL. This surgical treatment option, however, may not be as familiar to Speech-Language Pathologists (SLPs). The following introduction offers a review of the SCL procedure along with resultant voice, speech, and quality of life information, followed by a review of the impact of visual information on speech and voice disorders.
The relatively recent surgical development aimed at functional preservation of the larynx, the SCL, may be used as either an initial treatment or as a salvage option for advanced laryngeal cancers . The SCL was initially developed in Europe in the 1950’s but was not found in the medical literature in the United States until the 1990’s . Although the SCL has been used in other countries, the adoption of the SCL in the United States has been slow . The reasons for the absence of the SCL are debatable, although technical difficulties of the procedure  and intense post-operative rehabilitation [5,8] have been suggested. Lai and Weinstein  argued that the technique was not embraced by surgeons in the United States until numerous studies with large sample sizes were published reporting the oncologic and functional successes. Schindler and colleagues  suggested that certain countries have not adopted the SCL due to the complexity of post-surgical management and increased variability in results. Oncologic outcomes following SCL have consistently shown the procedure to have excellent local control with low mortality rates [5,9]. The oncologic results are especially of note given that the SCL is completed without the need for a permanent tracheostoma which is a required sequelae following a TL. The presence of a stoma and respiratory complications associated with the stoma has been identified as significant concerns for patients following a TL [10,11].
The body of research on SCL is growing, specifically with regards to oncologic and swallowing outcomes, but currently, there is minimal information on the impact of a SCL on a patient’s overall Quality of Life (QoL) and psychosocial status. Of the few existing studies, the consensus is that SCL results in improved psychosocial measures relative to TL [12,13]. Results of voice and speech related QoL measures following SCL are also limited. Makeieff and colleagues [14,15] used the Voice-Handicap Index (VHI) to assess the impact of altered voice function following SCL. Results of both studies suggested that the resultant speech following SCL has a substantial impact on social and professional activities, especially for those patients that rely on their voice. Dworkin, et al.  also used the VHI to compare voice handicap in SCL and TL patients. Results showed no significant difference between the two types of surgery and that both SCL and TL patients experience moderate difficulties in communication due to their voice function. In contrast, Saito, et al.  used the Voice-Related QoL (VRQoL) but interpreted their results to suggest that patient’s experience ‘little inconvenience’ in terms of speech after surgery. Weinstein, et al.  found significantly higher scores for the SCL patients on the Physical Functioning and Total Score components of the VRQOL compared to the TL patients. Weinstein et al. attributed the increased VRQOL scores relative to the TL patients to the lack of a stoma, which makes the patients ‘feel fortunate’. Finally, Evitts and colleagues  showed increased communicative competence with patients in Greece that had conservation surgery for laryngeal cancer. In summary, the limited research in this area presents with mixed results.
Research has also addressed the resultant voice signal following SCL and a brief summary of the results are presented. Using a standardized perceptual rating system (GRBAS), most research describes the SCL voice as breathy and rough [4,17,19], or hoarse-strained . Results of acoustic analyses typically show the SCL voice to be substantially different than normal laryngeal voice . Although most studies identify the speech following SCL to be functional and intelligible [20,21], patients have reported their speech to be severely dysphonic .
To summarize, research exists on a variety of areas concerning the SCL procedure, from oncologic outcomes to voice related to QoL, among others. However, one important aspect that is missing from the literature is how listeners will perceive the resultant voice signal or even the person. Comparisons to other treatments for laryngeal cancer or other alaryngeal speech modes may also be an important factor to consider in those instances where persons with laryngeal cancer are presented with treatment options. To date, there is no information on listeners’ impressions of SCL speakers as compared to normal laryngeal or other speakers with a TL and who use a form of alaryngeal speech (tracheoesophageal, esophageal, electrolarynx). There are studies from the TL literature that suggest that social interaction has an effect on overall QoL. For example, Deshmane, et al.  reported that 70% of TL patients suffer from decreased social acceptance and 82% suffer from reduced social activity. Nalbadian and colleagues  reported the communication problems with unfamiliar people were reported by 57% of the TL patients. Such results may be related to the physical appearance of the TL speaker (e.g., presence of a stoma, alteration of the vocal tract) rather than solely related to a different voice.
In an experiment that involved tracking a communication partner’s eye-gaze, Evitts and Gallop  foun different patterns of eye- gaze dependent on the type of alaryngeal speech used. For instance, during conversation with a speaker that used proficient esophageal or proficient electrolaryngeal speech, partners would direct their gaze predominantly at the lower face of the speaker . When conversing with a normal laryngeal speaker or a speaker that used proficient tracheoesophageal speech, 61% of the partners gaze would be focused on the lower face and 38% of the gaze would be divided among the background, lower face, and eyes . The authors attributed the difference in part to the inherent visual nature of esophageal and electrolaryngeal speech. That is, the extraneous facial movements associated with esophageal speech production (i.e., injection of air) and the addition of a mechanical device served to alter the eye gaze of the conversational partner and created a non-typical social interaction.
Specific to voice disorders, research has consistently demonstrated the importance of the visual component when obtaining listener perceptions. Numerous studies from a variety of fields have shown that visual information can have a negative effect on listener perceptions [24,25]. Moreover, these results suggest that speakers with a speech or voice disorder are perceived less favorably than normal speakers and that these impressions are impacted by the inclusion of visual information. Due to the inherent differences between normal laryngeal and alaryngeal speech [26,27] it may not be appropriate to extrapolate information from laryngeal speakers to alaryngeal speakers. In addition, IWL can present with significantly altered visual information than other disordered populations, including decreased vocal tract volume, presence of a stoma or the use of a prosthetic or mechanical speaking device.
Historically, studies in the field of alaryngeal speech and perception have recognized the importance of visual information in perceptual studies. One of the earliest references to this was in 1955 when Hyman called for the use of ‘motion picture films with sound’ to study the visual aspects of esophageal and electrolaryngeal speech. Numerous other researchers followed this suggestion by including such things as, for example, observations in real-time from gas station attendants when speaking to a person who used electrolaryngeal speech . Overall results have consistently showed that listeners perceived the esophageal speakers more negatively than the normal speakers across all measures. Although these studies provide important insight into how listeners perceive IWL in terms of personality, voice, speech, acceptability, among others, there are currently no studies that compare TL to new surgical treatment methods for laryngeal cancer.
The primary purpose of this study was to provide insight into differences in listener impressions based on mode of presentation (audio-only vs. audio-visual) and mode of speech (normal laryngeal, tracheoesophageal, esophageal, electrolaryngeal, SCL). Specific research questions are as follows:
1. Is there a difference in listener impressions based on mode of presentation (audio-only, audiovisual)?
2. Is there a difference in listener impressions based on mode of speech (supracricoid, tracheoesophageal, esophageal, electrolaryngeal, normal laryngeal)?
Clinically, the information yielded from this study may provide important insight for people diagnosed with laryngeal cancer on how they may be perceived with their new form of voice. In addition, results of the study may have an impact on the type of surgery the surgeon recommends to the person with laryngeal cancer. For instance, if results show that supracricoid laryngectomy results in improved listener impressions and the person is a candidate for organ preservation surgery  for requirements of the surgeon and Turfano, 2002  for key principles of organ preservation surgery), then these results should be taken into account.
The study was approved by the Institutional Review Boards of both Towson University and Johns Hopkins School of Medicine. All participants provided written consent agreeing to the use of their images and voices for the purposes of the research study.
Five modes of speech (normal laryngeal, tracheoesophageal, esophageal, electrolaryngeal, SCL) were included in the study. The methods and criteria for speaker selection were similar to recent studies [31,32]. Briefly, three speakers from each mode of speech were selected from a collection of recordings. Inclusion criteria for all speakers included: standard Midwest dialect, English as their primary language, fluent and effortless speech production. Exclusion criteria included presence of facial hair, significant facial asymmetry, facial scars other than those associated with the laryngectomy, and a history of stroke or other neurological disorder that affects speech or cognition. The speakers were then informally assessed by two licensed and certified SLPs with at least 10 years of clinical voice experience. The SLPs assessed whether or not each speaker was ‘typical’ for that mode of speech and informally rated their speech intelligibility using a Likert-style rating scale following the presentation of a reading passage produced by each speaker (poor-average-above average). From those speakers who had intelligibility ratings of ‘average’ or ‘above average’ and were rated as ‘typical’, a final group of speakers that were used for the experiment was collected. Efforts were then taken to age- and gender-match all of the speakers in the final set, including the normal laryngeal speakers. Mean ages of the final pool of male speakers for each mode were: normal laryngeal = 66 years, tracheoesophageal = 62 years, esophageal = 66 years, electrolaryngeal = 68 years, and SCL = 52 years.
Speakers were recorded in quiet room while seated and with a bare wall behind them. Following consent and a description of the study, a headset microphone (AKG, C 420 III) was placed on the head of each subject and the microphone itself was placed two inches from the corner of the mouth. The microphone was directly connected to a video recorder (Sony, DCR-HC30) and recorded on digital videotapes (Panasonic, DVM 60). Speakers were provided a copy of the grandfather passage  to review prior to being recorded. A tripod was used to elevate the video recorder on a table top which was placed in front of the speakers. Efforts were made to have the speaker and video recorder to be in the same horizontal plane. The grandfather passage was displayed on an 8 ½” x 11” sheet of white paper with a 1” hole cut out of the center of the page. This was done so that it would appear that the speaker would maintain eye-contact with the video recorder while reading. Speakers were asked to sit relatively still other than movements needed for voice production (e.g., digital occlusion). Each speaker was recorded with his entire head and neck in the frame and a small portion of the bare background.
Individual audiovisual files were created for each speaker using a video editing software program (Final Cut Pro X, Apple Inc.). Any noise or visual movements (e.g., stomal noise at the beginning of a sentence) associated with speech production for any modes of speech were included in the final file. The final edited version of each speaker’s audiovisual file was then imported into an acoustic editing software program (Adobe Audition 3.0, Adobe) which extracted the audio from the video file. Individual audio-only files were then created for each speaker. All files were stored for playback using Windows Media Player (Microsoft Corp).
Thirty-five naïve listeners (29 female, 6 male, mean age 31.1 years) served as participants for the study. Participants were recruited from a variety of undergraduate general education courses in different departments at a mid-Atlantic University. Inclusion criteria for the participants included English as their primary language, minimal to no exposure with alaryngeal speech or laryngeal cancer, sufficient visual acuity to accurately view a computer monitor at a distance of 2-3 feet, and no history of learning disability, speech or language disorder, or hearing disorder. Hearing status was assessed on the day of participation via audiometric screening (20 dB HL @ 0.5, 1, 2, 4 kHz). Participants were individually and randomly presented 30 recordings of the grandfather passage (15 audio-only, 15 audiovisual).
Participants were individually seated in a quiet room in front of a computer and a 22” LCD computer monitor (Acer AL2216W). Seating was arranged so that there would be approximately two feet between the LCD monitor and the participant. Audio was provided through a pair of noise-cancelling headphones (Sony MDR-NC60) connected to the computer. Each participant was instructed that they would be presented with a series of people reading a standard reading passage and that some files would be audio-only and some would be both audio and visual. Following the passage, the participants were instructed to rate that speaker using the given rating sheet. Participants were then provided the rating sheet and a brief explanation of how to use a visual analog scale was provided. Once each participant stated they understood the procedure and the scale, each participant was then presented all 30 files (15 speakers x 2 modes of presentation) in a randomized order.
The visual analog rating scale was based on earlier versions [31,34] and uses positive and negative anchors at each end of a 10 cm line. Users are requested to make a mark on the line nearest the term that they feel best describes the person. Using digital calipers, the distance is then measured in millimeters (mm) from the left, positive anchor to the mark. Lower values are associated with more positive values and higher values are associated with more negative or less favorable. The first section focused on the personality of the speaker and contained eight descriptors (e.g., outgoing-sincere, coordinated-clumsy). The second section focused on how they would feel interacting with the person (e.g., I would be very likely-very unlikely to speak to this person at a social function). The third section focused on the person’s speech characteristics (e.g., flowing-choppy, soothing-irritating). Finally, participants were asked to write down responses to two open-ended questions regarding the speaker. The first question was ‘What words would you use to describe the person’s speech’. The second question was ‘Was there anything that distracted your attention when you were listening to or watching the person?’
Figures 1-3 show the combined (audio-only and audio-visual) mean perceptual ratings by mode of speech across the three categories on the listener rating sheet. Lower values (0-50 mm) on the visual analog scale represent more favorable listener impressions and higher values (50-100 mm) represent less favorable. Values near 50 mm are considered neutral. Informal analysis of listener ratings of voice quality (Figure 1) shows that normal laryngeal is more favorable than all modes of alaryngeal speech and that tracheoesophageal and SCL were more favorable than esophageal or electrolaryngeal. Informal analysis of personality ratings (Figure 2) shows that normal laryngeal is rated more favorably than all modes of alaryngeal speech and that SCL is more favorable than the other modes of alaryngeal speech. Informal analysis of listener ratings of comfort of speech (Figure 3) shows that normal laryngeal is rated as more favorable than all modes of alaryngeal speech and that tracheoesophageal and SCL are found to be more favorable than esophageal and electrolaryngeal. Overall, informal analysis among the 24 ratings shows that normal laryngeal speech is perceived by listeners to be more favorable than all other modes of speech. In addition, SCL and tracheoesophageal speech appear to be more favorable than both esophageal and electrolaryngeal. Finally, across most ratings, listeners rated esophageal to be the least favorable across all modes of speech.
To determine if there were main and mixed effects for the independent variables of mode of presentation and mode of speech, a Multivariate Analysis of Variance (MANOVA) was calculated using a statistical software package (IBM SPSS, version 19). In order to control for Type 1 errors, alpha values were adjusted to p < .01 (5 variables / .05). The first research question focused on the effect of mode of presentation (audio-only vs. audiovisual). Results of the MANOVA showed a significant main effect for mode of presentation, Pillai’s Trace = .100, F (24, 874) =4.05, p <.001, partial η2=.100. The observed power of the test was high (1.00 at an alpha level of .05). The second research question focused on differences based on mode of speech. Results of the MANOVA showed a significant main effect for mode of speech, Pillai’s Trace = 1.475, F (96, 3508) =21.349, p <.001, partial η2=.369. The observed power of the test was high (1.00 at an alpha level of .05). Results of the MANOVA for mixed effects of mode of speech and mode of presentation also showed a significant main effect, Pillai’s Trace = .025, F (12, 6355) =4.998, p <.001, partial η2=.008.
Due to the large number of variables, it was statistically prudent to determine if interdependencies existed among the variables. For this, a correlation matrix was performed for all of the variables within each mode of presentation. Results of the correlation matrix showed that all of the 224 correlations were significantly related (p < .05). R values for the audio-only mode ranged from .305 to .889 and r values for the audiovisual mode ranged from .144 to .898. In such a case where multicollinearity exists, it is recommended to reduce the number of variables by identifying clusters of variables called factors . Using the statistical software package (IBM SPSS), a factor reduction was performed and those factors with eigen values greater than 1.0 were selected . Table 1 shows the factors and percent of variances for both audio-only and audiovisual. Once the factors are identified, the examiner then needs to identify the theme within those factor ratings. Ratings within each factor for both modes of presentation were consistent with the listener rating sheet. Thus, the ratings within Factor 1 were related to the speakers’ personality (factor loadings .904-.728), the ratings within Factor 2 were related to comfort of speech (factor loadings .648-.849), and the ratings within Factor 3 were related to voice quality (factor loadings .831-.578). Factor loadings greater than .400 are considered to be strong . The three factors accounted for 78.7% of the variance within the audio-only mode and 79.16% of the variance within the audio-visual mode.
Once the interdependence among the variables was accounted for and three factors were identified, an additional MANOVA was performed. Table 2 shows the mean and standard deviation values by mode of speech, mode of presentation, and by factor. Research question 1 addressed differences in listener impressions based on mode of presentation. Results of the MANOVA showed a significant main effect for mode of presentation, Pillai’s Trace= .049, F(3, 2402) = 41.43, p <.001, partial η2=.049. The observed power of the test was high (1.00 at an alpha level of .05). Bonferroni post-hoc analysis showed that across all modes of speech, factor 1 was rated significantly higher (less favorable) in the audio-only condition, F(1,2404)=92.76, p < .001, partial η2=.037. Mean ratings for audio-only and audiovisual were 48.30 mm and 41.68 mm, respectively, representing a 14% difference.
The second research question focused on the effects of mode of speech on listener impressions. Results of the MANOVA showed a significant main effect for mode of speech, Pillai’s Trace = .589, F(12, 7212) =146.78, p <.001, partial η2=.196. The observed power of the test was high (1.00 at an alpha level of .05). Bonferroni post-hoc testing (Figure 4) for factor 1 showed that: normal laryngeal was more favorable than all modes of alaryngeal speech (p < .001) and that SCL was more favorable than esophageal or electrolaryngeal (p < .001). Post-hoc testing for factor 2 showed that normal laryngeal was more favorable than all modes of alaryngeal speech (p < .001); esophageal was less favorable than tracheoesophageal and SCL (p < .001); and electrolaryngeal was less favorable than tracheoesophageal (p = .004). Finally, post-hoc testing for factor 3 showed that normal laryngeal was more favorable than all modes of alaryngeal speech (p < .001); esophageal was perceived as less favorable than all other modes of speech (p <.001); and electrolaryngeal was perceived as more favorable than SCL (p = .01).
Testing for the interaction between mode of speech and mode of presentation, results of the MANOVA showed a showed a significant main effect, Pillai’s Trace = .025, F(12, 7212) =4.998, p <.001, partial η2=.008. The observed power of the test was high (1.00 at an alpha level of .05). Univariate ANOVA and Bonferroni post hoc tests were conducted as follow-up tests. Results showed significant differences for factors 1-3 in the audiovisual mode and no significant differences in the audio-only mode. Bonferroni post hoc analysis of audiovisual factor 1 F(4, 1583)=30.30, p < .001 and audiovisual factor 2 F(4, 1186)=50.67, p < .001 showed that normal laryngeal speech was more favorable than all modes of alaryngeal speech (p values were all < .001). There were no differences among the alaryngeal modes in either factor. Post hoc analysis of factor 3 F(4, 1978)=495.18, p < .001 showed that normal laryngeal was more favorable than all modes of alaryngeal speech (p < .001) and that the voice quality of esophageal was perceived as less favorable than all other modes of alaryngeal speech (range of p values <.001 - .021).
Similar to previous research in the fields of psychology [37,38], stuttering [39,40], and alaryngeal speech , open-ended questions were analyzed using a theme-based approach. That is, responses to the two open-ended questions were reviewed by one examiner who tracked the number of responses within the categories that emerged. Only those adjectives that had a frequency count of more than four were included in the final descriptive analysis. Please note that the sum of the total responses does not equal the total number of listeners as many questions were left blank. Results of the theme-based analysis are shown in Table 3. The responses to question 1 (What words would you use to describe this persons speech?) with the highest frequency count by mode of speech include: normal laryngeal= normal sounding (n = 20 in both audio-only and audiovisual); Tracheoesophageal = gurgly (n = 15 audio-only and n = 12 audiovisual); Esophageal = choppy (n = 13 audio-only and n = 14 audiovisual); Electrolaryngeal= mechanical sounding (n = 34 in both audio-only and audiovisual); and SCL = rough (n = 13 audio-only) and raspy (n = 13 audiovisual). The responses to question 2 (Was there anything that distracted your attention when you were listening to or watching the person?) with the highest frequency count by mode of speech included: Normal laryngeal = no comments; Tracheoesophageal = wet (n = 5 audio-only) and touching throat (n = 30 audiovisual); Esophageal = extra noises (n = 9 audio-only) and movements of the mouth or face (n = 27); Electrolaryngeal = robotic (n = 7 audio-only) and the device (n = 20 audiovisual); and SCL = rough (n = 4 audio-only).
The purpose of this study was to expand upon previous research on listener impressions of speech of patients that underwent surgical treatment for laryngeal cancer. The study sought to include SCL speech in particular as there is limited information on how others’ perceive the speech, voice, and the person following SCL surgery. Specific research questions focused on mode of presentation and mode of speech. Overall, results suggest that listeners have more favorable impressions of all speakers in the audiovisual mode and that listeners’ perceive SCL speakers as comparable to tracheoesophageal speakers and in some instances, more favorable than other modes of alaryngeal speech. Results also suggest a significant interaction effect between audiovisual information and mode of speech. Post hoc tests showed that this interaction effect was evident in the audiovisual mode but not the audio-only mode suggesting that the visual information associated with some of the forms of alaryngeal speech may play a significant role in listeners’ impressions. Specific results are discussed below.
Results of this study suggest that when listeners are provided with audiovisual information, the result is more favorable impressions of the personality of a speaker but not necessarily how comfortable they feel listening to a speaker or the speakers’ voice quality. Although statistically significant, the difference only represented a 14% difference in ratings on the visual analog scale and both mean values (48 mm for the audio-only and 41 mm for the audiovisual mode) were close to ratings associated with neutral impressions. It should also be noted that only 5% of the variability associated with this factor could be attributed to the audiovisual condition. Regardless, the results suggest and support previous research [40,42] that visual information has an impact on how someone is perceived. In fact, the 14% difference in personality rating is nearly identical to the difference reported in an earlier study using similar methods (Evitts, et al. 2009). That study, however, only included one speaker from each mode of speech while this study increased that amount to three speakers per mode and added an additional mode of alaryngeal speech (SCL). The mean listener ratings in the audiovisual mode for personality were also relatively consistent between the two studies: 34 mm previously compared to 41 mm in the current study, suggesting increased validity for the current study.
The current results have implications for both health care professionals working in this field as well as future research. Clinically, results may highlight the need for health care professionals to educate patients and families on the importance of the visual effects of surgical treatment for laryngeal cancer. Obviously, there are inherent visual differences among different modes of speech and different surgical procedures (e.g., use of a mechanical speaking device, presence of a stoma). But those inherent differences are also intricately linked to how other people may perceive them, how intelligible their speech will be, and their overall quality of life. That there are differences in how they may be perceived by others based on mode of speech is an important piece of education to provide patients and their families. Future research could investigate if there are specific visual features within each mode of speech that are associated with favorable or less favorable listener perceptions. This research could also investigate if specific features are associated with changes in speech intelligibility across modes. Nearly 60 years have passed since Melvin Hymen (1955) argued the need to include visual information in this field of research and there is still much work to be done.
Overall, results of the current study showed a significant effect for mode of speech which accounted for nearly 20% of the variability among differences shown. The general trend across individual ratings and factors showed that normal laryngeal speech was perceived as more favorable than all modes of alaryngeal speech and Within factor 1, the personality of the normal laryngeal speakers was more favorable than all alaryngeal modes and the personality of the SCL speakers were perceived as more favorable than the esophageal and electrolaryngeal speakers. Results for factor 2 showed that normal laryngeal speakers were perceived as more favorable than all modes of alaryngeal speech and that tracheoesophageal and SCL speakers were more favorable than the esophageal speakers. Results for factor 3 were also similar in that listeners perceived the voice quality of normal laryngeal speech as more favorable than all modes of alaryngeal speech and the voice quality of esophageal speech was less favorable than all other modes of alaryngeal speech. That the normal laryngeal speakers were rated as more favorable than the alaryngeal speakers for all three factors is consistent with the bulk of the literature (e.g., Evitts, et al. 2009). Of note, however, was that personality of the SCL speakers were perceived as more favorable than the esophageal and the electrolaryngeal speakers. Although the actual difference in mean listener ratings was quite small (~10%), the results may be hopeful for those who undergo the SCL surgery.
Across all three factors, listeners had similar perceptions of the SCL and tracheoesophageal speakers. As discussed earlier, the general hierarchy in the literature is that normal laryngeal speech is more favorable than tracheoesophageal speech which is more favorable than esophageal speech which more favorable than electrolaryngeal speech. Including a relatively newer form of conservation surgery, specifically SCL, suggests that SCL speech approximates tracheoesophageal speech in that hierarchy. However, SCL speech may actually be closer to normal laryngeal speech in that it is produced with pulmonary airflow (as is tracheoesophageal) but also uses laryngeal tissue for the vibratory source whereas tracheoesophageal speech uses upper esophageal sphincter and cricopharyngeal muscle fibers for the newly created pharyngoesophageal segment. This distinction may have importance when it comes to the brain processing the signal. Specifically, the brain has been shown to discriminate between human vocalizations and non-human vocalizations  and it appears voice or speech that most closely approximates human vocal fold vibration requires less cognitive work load from listeners . Although not borne out in this study, it may be that the brain distinguishes or favors SCL speech relative to tracheoesophageal speech when various outcome measures are used. Regardless, considering that SCL results in improved QoL over all other modes following TL, the argument could be made that SCL speech is superior to tracheoesophageal speech. More favorable listener impressions of SCL speech may be a part of that increased QoL observed with persons treated with SCL.
Aside from favorable perceptions of SCL speech, listener perceptions of esophageal speech were also of note. The current results are consistent with previous studies which have shown esophageal speech to be perceived as less favorable  and less intelligible  than other modes of alaryngeal speech and to require additional cognitive work load than other modes of alaryngeal speech . Esophageal speech has also been shown to be associated with different eye-gaze patterns from healthy control conversational partners during face-to-face interaction compared to other modes of alaryngeal speech . Although there are advantages to esophageal speech (i.e., hands-free mode, no tracheoesophageal fistula required), recent results suggest that other modes, including electrolaryngeal speech, may be a better option.
One of the inherent difficulties with studying disordered speech, regardless of the nature, is an increased heterogeneity of the resultant signal, even within mode of speech [26,45] for variability in acoustic measures among modes of speech). This study attempted to account for that by including three speakers from each mode and by attempting to use ‘typical’ speakers within each mode. Although the experienced clinicians who served as judges in this study were able to identify those ‘typical’ speakers, there is still a great deal of ambiguity as to what exactly that entails. This study utilized speakers with similar intelligibility, who were age- and gender matched, and had relatively similar visual appearances (e.g., no facial hair). Rate of speech was also addressed across speakers and a one-way ANOVA of words per minute for the grandfather passage by mode showed that esophageal speech was significantly slower than normal and SCL speech. This is consistent with previous research  and may provide additional validity that the speakers used in this study represented fall within that category of ‘typical’.
One particular mode of speech that may present with increased inherent differences is SCL. As discussed earlier, this surgery yields either one of two types, CHEP or CHP. Due to the decreased amount of tissue resected, CHEP has been shown to be more favorable . In this study, two of the subjects were CHEP and one was CHP. Comparisons across speakers by type of SCL showed significant differences in all three factors in both audio-only and audiovisual mode although the trend was that the speakers with a CHEP were perceived as more favorable than the speaker with a CHP. However, there was also significant differences present between the two CHEP speakers. These findings are consistent with previous research indicating that CHEP may be more favorable than CHP and may add to the validity of the current study. In addition, the differences present between the two CHEP speakers support the notion increased heterogeneity in most disordered speaker populations.
Aside from there being significant effects of mode of presentation and mode of speech, there was also a significant interaction effect between the two variables. Subsequent analyses showed that this effect was only shown in the audiovisual condition and not in the audio-only condition. Overall results in the audiovisual condition for all three factors showed that normal laryngeal speech was more favorable than all modes of alaryngeal speech. Additionally, the voice quality of esophageal speech was found to be significantly less favorable than all other modes but only in the audiovisual condition. This finding highlights the importance of visual information when discussing listener perceptions. That is, there are specific visual components inherent to each mode of speech that influences how listeners perceive a speaker. The qualitative comments (Table 3) provide insight on this interaction. Comments from question 1 (i.e., words used to describe the speaker) were consistent with expectations. For example, numerous comments identified electrolaryngeal speech as mechanical sounding and esophageal speech as choppy which is consistent with the reduced airflow and subsequent reduced rate of speech observed with esophageal speech.
Responses to question 2 (i.e., anything distracting) provide much more insight into what characteristics listeners found salient. Some of those inherent traits for each mode are visual in nature and the current results suggest that these inherent visual traits may impact listener perceptions. For instance, 30 listeners commented about the tracheoesophageal speakers touching their throat11, 27 listeners commented about the facial movements for the esophageal speakers, and 32 listeners commented about being distracted by the device itself or the person touching their throat or hand movements. There were no responses in the audiovisual mode for the normal laryngeal or the SCL speakers. These results in combination with other results of the current study suggest that SCL speech may not only be closer to normal laryngeal speech in terms of speech production, but also with the visual component of speech production. Those inherent visual traits associated with other speech modes may not only impact listener perceptions of personality or voice quality, but may also impact speech perception overall. For example, head movements on normal speakers have been associated with the speakers’ fundamental frequency and amplitude of the speech signal whereas altered head movements was shown to result in decreased speech perception (Munhall, et al. 2004). Additionally, recent neuroimaging data showed that when listeners are presented with degraded auditory stimuli, listeners increased their attention to the visual information .
1It should be noted that hand-free tracheoesophageal prosthetics are available which would not require the speaker to occlude their stoma for voice production. However, for a variety of reasons only a small percentage of tracheoesophageal speakers use a hands-free device.
Although SCL speech may be considered a degraded auditory stimulus compared to normal laryngeal speech , the visual information most closely approximates that of normal laryngeal speech and thus from a perceptual standpoint, the listener treats it in a similar fashion. Moreover, it may be those inherent visual characteristics of speech production from the esophageal or electrolaryngeal speakers in particular, are directly related to the current findings. That is, the more distracting the visual information, the more it impacts the listener. This incongruence between visual and auditory signal has been implicated in the reduced speech intelligibility observed by IWL, thus creating a McGurk effect of sorts . Clearly, more research is needed to delineate the role of visual information in speech processing for speakers with an SCL.
Although significant results are reported here and additional insight into listener perceptions are provided, the low eta squared values for each of the variables suggest other factors are influencing listener perceptions. In a previous study on speech intelligibility, Evitts, et al.  reported that approximately 80% of the variability associated with speech intelligibility was accounted for by mode of speech. However, when the interaction between mode of speech and mode of presentation was considered, values of 6% to 23% were reported . Individual speaker differences may have played a role in this study as it included three speakers from within each mode. Although this increases the ability to generalize, it may alternatively decrease the variability accounted for. This was originally argued by Kalb and Carpenter  who stated that individual speaker characteristics played a larger role in speech intelligibility than mode of speech. That same influence of individual speaker characteristics may be true for listener perceptions as well. More research is needed to shed light on this issue.
There are several limitations to this pilot study that make it difficult to generalize to other speakers with a laryngectomy. First, the disordered speakers were selected based on experienced SLPs’ rating as ‘typical’ and having average or above average intelligibility. As clinicians and health care professionals working with this population know so well, there is a great deal of heterogeneity in this population with regard to voice function following any form of treatment for laryngeal cancer. Although three speakers from each mode were included in this study and were considered by experienced SLPs to be ‘typical’ for their mode, additional research is warranted on those that may not represent ‘typical’. Moreover, additional research is warranted on those with decreased intelligibility in an attempt to better understand the relationship between intelligibility and listener impressions. Ideally, future research would consist of a large sample size with numerous speakers in each mode of speech representing varying degrees of intelligibility and voice quality. Second, the sentence stimuli that were used were initially intended to balance phonemic information but not visual information. Future visual processing research should control for this and other possible effects, including semantic and syntactic predictability [49-55]. Third, the listeners used in the current study may not represent the peer group of the population. Listeners in this study were predominantly young females and future research should seek to include persons that would better represent the peer groups of the patient population. This would include the use of spouses as potential listeners. Finally, only males with a laryngectomy were included in this study. Since more women are being diagnosed and treated for laryngeal cancer, similar studies should include the effects on females.
The purpose of this experiment was to investigate the effect of mode of speech and mode of presentation on listeners’ perceptions of speech following surgical treatment for laryngeal cancer. In particular, this study sought to include a relatively new form of conservation surgery, supracricoid laryngectomy, as this form of treatment has been associated with improved QoL compared to TL . Although there is research on a variety of outcomes following SCL, there is a lack of research on how listeners perceive this mode. Mean listener perceptions across ratings suggest that all modes of speech were either perceived as favorable or neutral for items related to personality (30-59 mm on a 100 mm visual analog scale) or comfort of speech (14-48 mm). Mean listener ratings for voice quality showed that normal was perceived as favorable but all modes of alaryngeal speech were perceived as less than favorable (61-76 mm). Overall results of the current study suggest that normal laryngeal speech is perceived as more favorable than all modes of alaryngeal speech across ratings of personality comfort of speech, and that SCL speech was found to be at least equal to tracheoesophageal speech in all three areas as well. Additionally, esophageal speech was consistently perceived as the least favorable across all ratings and listener qualitative comments suggest that the extraneous facial movements of the esophageal speakers may be associated with this finding. Furthermore, the personality of SCL speakers was perceived as the most favorable among all the modes of alaryngeal speech and the voice quality and comfort of speech of the SCL speakers were found to be more favorable than the esophageal or electrolaryngeal speakers. Supracricoid laryngectomy has been associated with improved QoL which may be primarily due to the lack of a permanent stoma as is the case following TL. However, this improved QoL may also be a function of more favorable listener perceptions. When treatment options are available for laryngeal cancer, this study supports the increased utilization of the SCL surgery.
The authors would like to thank all of the participants in this study, in particular the people with laryngeal cancer that donated their time and efforts in hopes of providing further insight into people’s reactions following their treatment. The authors would also like to thank Christen Montgomery for her editorial assistance throughout the stages of this manuscript. Portions of this research were previously presented at the American Speech-Language and Hearing Association (ASHA) Annual Convention (Chicago, IL, 2010).
Subscribe to our articles alerts and stay tuned.