Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 13 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
13
Dung lượng
168,01 KB
Nội dung
Shore, J. “Software ToolsforSpeechResearchand Development”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c
1999byCRCPressLLC
50
Software ToolsforSpeech Research
and Development
John Shore
Entropic Research
Laboratory, Inc.
50.1 Introduction
50.2 Historical Highlights
50.3 The User’s Environment (OS-Based vs. Workspace-Based)
Operating-System-Based Environment
•
Workspace-Based
Environment
50.4 Compute-Oriented vs. Display-Oriented
Compute-Oriented Software
•
Display-Oriented Software
•
Hybrid Compute/Display-Oriented Software
50.5 Compiled vs. Interpreted
Interpreted Software
•
Compiled Software
•
Hybrid Inter-
preted/Compiled Software
•
Computation vs. Display
50.6 Specifying Operations Among Signals
Text-Based Interfaces
•
Visual(“Point-and-Click”) Interfaces
•
Parametric Control of Operations
50.7 Extensibility (Closed vs. Open Systems)
50.8 Consistency Maintenance
50.9 Other Characteristics of Common Approaches
Memory-basedvs. File-based
•
DocumentationofProcessing
History
•
Personalization
•
Real-Time Performance
•
Source
Availability
•
HardwareRequirements
•
Cross-PlatformCom-
patibility
•
DegreeofSpecialization
•
SupportforSpeechInput
and Output
50.10 File Formats (Data Import/Export)
50.11 Speech Databases
50.12 Summary of Characteristics and Uses
50.13 Sources for Finding Out What is Currently Available
50.14 Future Trends
References
50.1 Introduction
Experts in every field of study depend on specialized tools. In the case of speechresearch and
development, the dominant tools today are computer programs. In this article, we present an
overview of key technical approaches and features that are prevalent today.
We restrict the discussion to software intended to suppor t R&D, as opposed to softwarefor com-
mercial applications of speech processing. For example, we ignore DSP programming (which is
discussed in the previous article). Also, we concentrate on software intended to support the special-
c
1999 by CRC Press LLC
ities of speech analysis, coding, synthesis, and recognition, since these are the main subjects of this
chapter. However, much of what we have to say applies as well to the needs of those in such closely
related areas as psycho-acoustics, clinical voice analysis, sound and vibration, etc.
We do not attempt to survey available software packages, as the result would likely be obsolete by
thetimethisbookis printed. Theexamplesmentioned areillustrative,andnot intendedtoprovidea
thorough or balanced review. Our aim is to provide sufficient background so that readers can assess
their needs and understand the differences among available tools. Up-to-date surveys are readily
available online (see Section 50.13).
In general, there are three common uses of speech R&D software:
• Teaching, e.g., homework assignments for a basic course in speech processing
• Interactive, free-form exploration, e.g., designing a filter and evaluating its effects on a
speech processing system
• Batch experiments, e.g., training and testing speech coders or speech recognizers using a
large database
The relative importance of various features differs among these uses. For example, in conducting
batchexperiments, itisimportantthatlargesignalscanbehandled,andthat complicatedalgorithms
executeefficiently. For teaching, on the other hand, these features are less important than simplicity,
quickexperimentation,andease-of-use. Becauseof practical limitations, suchdifferencesinpriority
mean that no one software package today can meet all needs.
To explain the variation among current approaches, we identify a number of distinguishing char-
acteristics. Thesecharacteristics arenotindependent(i.e.,thereisconsiderable overlap),buttheydo
help to present the overall view.
For simplicity, we will refer to any particular speech R&D software as “the speech software”.
50.2 Historical Highlights
Earlyorsignificantexamplesof speechR&Dsoftwareinclude“VisibleSpeech”[5],MITSYN[1], and
Lloyd Rice’s WA VE program of the mid 1970s (not to be confused with David Talkin’s waves [8]).
The first general, commercial system that achieved widespread acceptance was the Interactive
Laboratory System (ILS) from Signal Technology Incorporated, which was popular in the late 1970s
and early 1980s. Using the terminology defined below, ILS is compute-oriented software with an
operating-system-based environment. The first popular, display-oriented, workspace-based speech
software was David Shipman’s LISP-machine application called Spire [6].
50.3 The User’s Environment
(OS-Based vs. Workspace-Based)
In some cases, the user sees the speechsoftware as an extension of the computer’s operating system.
We call this “operating-system-based” (or OS-based); an example is the Entropic Signal Processing
System (ESPS)[7].
In other cases, the software provides its own operating environment. We call this “workspace-
based” (from the term used in implementations of the programming language APL); an example is
MATLAB
TM
(from The Mathworks).
c
1999 by CRC Press LLC
50.3.1 Operating-System-Based Environment
Inthisapproach,signals arerepresented as files under the native operating system (e.g., Unix, D OS),
and the software consists of a set of programs that can be invoked separately to process or display
signals in various ways. Thus, the user sees the software as an extension of an already-familiar oper-
ating system. Because signals are represented as files, the speechsoftware inherits file manipulation
capabilities from the operating system. Under Unix, for example, signals can be copied and moved
respectively using the cp and mv programs, and they can be organized as directory trees in the Unix
hierarchical file system (including NFS).
Similarly, the speechsoftware inherits extension capabilities inherent in the operating system.
UnderUnix,forexample,extensionscanbecreatedusingshellscriptsinvariouslanguages(sh,csh,Tcl,
perl, etc.),aswellassuchfacilitiesaspipesandremoteexecution. OS-basedspeechsoftwarepackages
are often called command-line packages because usage typically involves providing a sequence of
commands to some type of shell.
50.3.2 Workspace-Based Environment
In this approach, the user interacts with a single application program that takes over from the
operating system. Signals, which may or may not correspond to files, are typically represented as
variables in some kind of virtual space. Various commands are available to process or display the
signals. Such a workspace is often analogous to a personal blackboard.
Workspace-based systems usually offer means for saving the current workspace contents and for
loading previously saved workspaces.
Anextensionmechanismistypicallyprovidedbyacommandinterpreterforasimplelanguagethat
includes the available operations and a means for encapsulating and invoking command sequences
(e.g., in a function or procedure definition). In effect, the speechsoftware provides its own shell to
the user.
50.4 Compute-Oriented vs. Display-Oriented
This distinction concerns whether the speechsoftware emphasizes computation or visualization or
both.
50.4.1 Compute-Oriented Software
If there is a large number of signal processing operations relative to the number of signal display
operations, we say that the software is compute-oriented. Such software typically can be oper ated
withoutadisplaydeviceandtheuserthinksofitprimarilyasacomputationpackagethatsupportssuch
functions as spectral analysis, filtering, linear prediction, quantization, analysis/synthesis, pattern
classification, Hidden Markov Model (HMM) training, speech recognition, etc.
Compute-oriented software can be either OS-based or workspace based. Examples include ESPS,
MATLAB
TM
, and the Hidden Markov Model Toolkit (HTK) (from Cambridge University and En-
tropic).
50.4.2 Display-Oriented Software
In contrast, display-oriented speechsoftware is not intended to and often cannot operate without
a display device. The primary purpose is to support visual inspection of waveforms, spectrograms,
and other parametric representations. The user typically interacts with the software using a mouse
or other pointing device to initiate display operations such as scrolling, zooming, enlarging, etc.
c
1999 by CRC Press LLC
While the software mayalso providecomputationsthatcanbe performed on displayedsignals (or
marked segments of displayed signals), the user thinks of the software as supporting visualization
more than computation. An example is thewaves program [8].
50.4.3 Hybrid Compute/Display-Oriented Software
Hybrid compute/display software combines the best of both. Interactions are typically by means
of a display device, but computational capabilities are rich. The computational capabilities may be
built-in to workspace-based speech software, or may be OS-based but accessible from the display
program. Examples include the Computerized Speech Lab (CSL) from Kay Elemetrics Corp., and
the combination of ESPS and waves.
50.5 Compiled vs. Interpreted
Here we distinguish accordingto whether the bulk of the sig nal processing or display code (whether
written by developers or users) is interpreted or compiled.
50.5.1 Interpreted Software
The interpreter language may be specially designed for the software (e.g., S-PLUS from Statistical
Sciences, Inc., and MATLAB
TM
), or may be an existing, general purpose language (e.g., LISP is used
in N!Power from Signal Technology, Inc.).
Compared to compiler languages, inter preter languages tend to be simpler and easier to learn.
Furthermore, it is usually easier and faster to write and test programs under an interpreter. The
disadvantage, relative to compiled languages, is that the resulting programs can be quite slow to run.
Asaresult,interpretedspeechsoftwareisusuallybettersuitedforteachingandinteractiveexploration
than for batch experiments.
50.5.2 Compiled Software
Comparedtointerpretedlanguages, compiledlanguages(e.g.,FORTRAN,C,C++)tendtobemore
complicatedandhardertolearn. Comparedtointerpretedprograms, compiledprograms areslower
to write and test, but considerably faster to run. As a result, compiled speechsoftware is usually
better suited for batch experiments than for teaching.
50.5.3 Hybrid Interpreted/Compiled Software
Someinterpretersmakeitpossibletocreatenew language commandswithanunderlyingimplemen-
tation that is compiled. This allows a hybrid approach that can combine the best of both.
Some languages provide a hybrid approach in which the source code is pre-compiled quickly into
intermediate code that is then (usually!) interpreted. Java is a good example.
If compiled speechsoftware is OS-based, signal processing scripts can typically be written in an
interpretivelanguage(e.g.,ashscriptcontainingasequenceofcallstoESPS programs). Thus,hybrid
systems can also be based on compiled software.
50.5.4 Computation vs. Display
Thedistinctionbetweencompiled and interpretedlanguagesisrelevant mostly tothecomputational
aspects of the speech software. However, the distinction can apply as well to display software, since
c
1999 by CRC Press LLC
somedisplayprogramsarecompiled(e.g.,usingMotif)whileothersexploitinterpreters(e.g.,Tcl/Tk,
Java).
50.6 Specifying Operations Among Signals
Here we are concernedwith the means by which users specify what operations are to be doneand on
whatsignals. Thisconsiderationisrelevanttohowspeechsoftwarecanbeextendedwithuser-defined
operations (see Section 50.7), but is an issue even in software that is not extensible.
The main distinction is between a text-based interface and a visual (“point-and-click”) interface.
Visual interfaces tend to be less general but easier to use.
50.6.1 Text-Based Interfaces
Traditional interfaces for specifying computations are based on a textual-representation in the form
of scripts and programs. For OS-based speech software, operations are typically specified by typing
the name of a command (with possible options) directly to a shell. One can also enter a sequence of
such commands into a text editor when preparing a script.
This style of specifying operations also is available for workspace-based speechsoftware that is
based on a command interpreter. In this case, the text comprises legal commands and programs in
the interpreter language.
Both OS-based and workspace-based speechsoftware may also permit the specification of opera-
tions using source code in a high-level language (e.g., C) that gets compiled.
50.6.2 Visual (“Point-and-Click”) Interfaces
Thepoint-and-clickapproachhasbecometheubiquitoususer-interfaceofthe1990s. Operationsand
operands (signals) arespecifiedbyusingamouseorotherpointing devicetointeractwith on-screen
graphical user-interface (GUI) controls such as buttons and menus. The interface may also have a
text-based component to allow the direct entry of parameter values or formulas relating signals.
Visual Interfaces for Display-Oriented Software
In display-oriented software, the signals on which operations are to be performed are visible
as waveforms or other directly representative graphics.
A typical user-interaction proceeds as follows: A relevant signal is specified by a mouse-click
operation (if a signal segment is involved, it is selected by a click-and-drag operation or by a pair of
mouse-click operations). Theoperation to be performed is then specified by mouse click operations
on screen buttons, pull-down menus, or pop-up menus.
This style works very well for unary operations (e.g., compute and display the spectrogram of a
given signal segment), and moderately well for binary operations (e.g., add two signals). But it is
awkward for operations that have more than two inputs. It is also awkward for specifying chained
calculations, especially if you want to repeat the calculations for a new set of signals.
Onesolutiontotheseproblemsisprovided bya“calculator-style”interfacethatlooksandactslike
a familiar arithmetic calculator (except the operands are sig nal names and the operations are signal
processing operations).
Another solution is the “spreadsheet-style” interface. The analogy with spreadsheets is tight.
Imagine a spreadsheet in which the cells are replaced by images (waveforms, spectrograms, etc.)
connectedlogicallybyformulas. Forexample,onecellmightshowatestsignal,asecondmightshow
the results of filtering it, and a third might show a spectrogram of a portion of the filtered signal.
This exemplifies a spreadsheet-style interface forspeech software.
c
1999 by CRC Press LLC
A spreadsheet-style interface provides some means for specifying the “formulas” that relate the
various “cells”. This formula interface might itself be implemented in a point-and-click fashion,
or it might permit direct entry of formulas in some interpretive language. Speechsoftware with
a spreadsheet-style interface will maintain consistency among the visible signals. Thus, if one of
the signals is edited or replaced, the other signal graphics change correspondingly, according to the
underlying formulas.
DADisp (from DSP De velopment Corporation) is an example of a spreadsheet-style interface.
Visual Interfaces for Compute-Oriented Software
In a visual interface for display-oriented software, the focus is on the signals themselves. In
a visual interface for compute-oriented software, on the other hand, the focus is on the operations.
Operations among sig nals typically arerepresented as iconswith one or more input and output lines
that interconnect the operations. In effect, the representation of a signal is reduced to a straight
line indicating its relationship (input or output) with respect to operations. Such visual interfaces
are often called block-diagram interfaces. In effect, a block-diag ram interface provides a visual
representation of the computation chain. Various point-and-click means are provided to support
the user in creating, examining, and modifying block diagrams.
Ptolomy [4] and N!Power are examples of systems that provide a block-diagram interface.
Limitations of Visual Interfaces
Although much in vogue, visual interfaces are inherently limited as a means for specifying
signal computations.
For example, the analogy between spreadsheets and spreadsheet-style speechsoftware continues.
For simple signal computations, the spreadsheet-style interface can be very useful; computations
are simple to set up and informative when operating. For complicated computations, however,
the spreadsheet-style interface inherits all of the worst features of spreadsheet programming. It is
difficulttoencapsulatecommonsub-calculations,anditisdifficulttoorganizethe “program”sothat
the computationalstructure is self-e vident. The resultis that spreadsheet-style programs are hard to
write, hard to read, and error-prone.
Inthisrespect,block-diagram interfacesdoabetterjobsincetheirmain focus is ontheunderlying
computation rather than on the signals themselves. Thus, screen “real-estate” is devoted to the
computation rather than to the signal graphics. However, as the complexity of computations grows,
the geometric and visual approach eventually becomes unwieldy. Whenwas the last time you used a
flowchart to design or document a program?
Itfollowsthatvisual interfaces for specifying computations tend to be best suitedfor teachingand
interactive exploration.
50.6.3 Parametric Control of Operations
Speechprocessingoperations oftenarebasedoncomplicatedalgorithmswithnumerousparameters.
Consequently, the means for specifying parameters is an important issue forspeech software.
The simplest form of parametric control is provided by command-lineoptions on command-line
programs. This is convenient, but can be cumbersome if there are many parameters. A common
alternative is to read parameter values from parameter files that are prepared in advance. Typically,
command-line values can be used to override values in the parameter file. A third input source for
parameter values is directly from the user in response to prompts issued by the program.
Some systems offer the flexibility of a hierarchy of inputs for parameter values, for example:
• default values
c
1999 by CRC Press LLC
• values from a global parameter file read by all programs
• values from a program-specific parameter file
• values from the command line
• values from the user in response to run-time prompts
In some situations, it is helpful if a current default value is replacedby the most recent input from
a given parameter source. We refer to this property as “parameter persistence”.
50.7 Extensibility (Closed vs. Open Systems)
Speech software is “closed” if there is no provision for the user to extend it. There is a fixed set of
operations available to process and display signals. What you get is all you get.
OS-based systemsarealwaysextensibleto a degree because they inherit scripting capabilities from
theOS,whichpermitsthecreationofnewcommands. Theymayalsoprovideprogramminglibraries
so that the user can write and compile new programs and use them as commands.
Workspace-basedsystemsmaybeextensibleiftheyarebasedonaninterpreterwhoseprogramming
language includes the concept of an encapsulated procedure. If so, then users can write scripts that
define new commands. Some systems also allow the interpreterto be extendedwith commands that
are implemented by underlying code in C or some other compiled language.
In general, forspeechsoftware to be extensible, it must be possible to specify operations (see
Section 50.6) and also to re-use the resulting specifications in other contexts. A block-diagram
interface is extensible, for example, if a given diagram can be reduced to an icon that is available for
use as a single block in another diagram.
For speechsoftware with visual interfaces, extensibility considerations also include the ability to
specifynewGUIcontrols(visiblemenusandbuttons),theabilitytotiearbitraryinternalandexternal
computations to GUI controls, and the ability to define new display methods for new signal types.
Ingeneral,extendedcommandsmaybehavedifferentlyfromthebuilt-incommandsprovidedwith
the speech software. For example, built-in commands may share a common user interface that is
difficulttoimplement inanindependentscriptorprogram (suchacommoninterfacemightprovide
standard parameters for debug control, standard processing of parameter files, etc.).
If user-defined scripts, programs, and GUI components are indistinguishable from built-in facili-
ties, we say that the speechsoftware provides seamless extensibility.
50.8 Consistency Maintenance
A speech processing chain involves signals, operations, and parameter sets. An important consid-
eration forspeechsoftware is whether or not consistency is maintained among all of these. Thus,
for example, if one input signal is replaced with another, are all intermediate and output signals
recalculated automatically? Consistency maintenance is primarily an issue forspeechsoftware with
visual interfaces,namelywhether ornotthesoftwareguaranteesthat allaspectsofthevisibledisplays
are consistent w ith each other.
Spreadsheet-style interfaces (for display-oriented software) and block-diagram interfaces (for
compute-oriented software) usually provide consistency maintenance.
c
1999 by CRC Press LLC
50.9 Other Characteristics of Common Approaches
50.9.1 Memory-based vs. File-based
“Memory-based” speechsoftware carries out all of its processing and display operations on signals
that are stored entirely within memory,regardless of whether or not the signals also have an external
representation as a disk file. This approach has obvious limitations with respect to signal size, but it
simplifies programming and yields fast operation. Thus, memor y-based software is well-suited for
teaching and the interactive exploration of small samples.
In“file-based”speechsoftware,ontheotherhand,signalsarerepresentedandmanipulatedasdisk
files. The software partially buffers portions of the signal in memory as required for processing and
display operations. Although programming can be more complicated, the advantage is that there
are no inherent limitations on signal size. The file-based approach is, therefore, well-suited for large
scale experiments.
50.9.2 Documentation of Processing History
Modernspeechprocessinginvolvescomplicatedalgorithmswithmanyprocessingstepsandoperating
parameters. As a result, it is often important to be able to reconstruct exactly how a given signal was
produced. Speechsoftware can help here by creating appropriate records as signal and parameter
files are processed.
The most common method for recording this information about a given signal is to put it in the
same file as the signal. Most modern speechsoftware uses a file format that includes a “file header”
that is used for this pur pose. Most systems store at least some information in the header, e.g., the
sampling rate of the signal. Others, such as ESPS, attempt to store all relevant information. In this
approach, the header of a signal file produced by any program includes the program name, values
of processing parameters, and the names and headers of all source files. The header is a recursive
structure,sothattheheadersof thesourcefilesthemselvescontainthenamesandheadersoffilesthat
wereprior sources. Thus,a signal file header containsthe headers of all source files in the processing
chain. It follows that files contain a complete history of the origin of the data in the file and all
the intermediate processing steps. The importance of record keeping grows with the complexity of
computation chains and the extent of available parametric control.
50.9.3 Personalization
There is considerable variation in the extent to which speechsoftware can be customized to suit
personal requirements and tastes. Some systems cannot be personalized at all; they start out the
same way, every time. But most systems store personal preferences and use them again next time.
Savable preferences may include color selections, button layout, button semantics, menu contents,
currentlyloadedsignals,visiblewindows,windowarrangement,anddefaultparametersetsforspeech
processing operations.
Attheextreme, some systemscansaveacomplete“snapshot”thatpermitsexactresumption. This
isparticularlyimportantfortheinteractivestudyofcomplicatedsignalconfigurationsacrossrepeated
software sessions.
50.9.4 Real-Time Performance
Software is generally described as “real-time” if it is able to keep up with relevant, changing inputs.
In the case of speech software, this usually means that the software can keep up with input speech.
Even this definition is not particularly meaning ful unless the input speech is itself coming from a
c
1999 by CRC Press LLC
human speaker and digitized in real-time. Otherwise, the real-issue is whether or not the software is
fast enough to keep up with interactive use.
For example, if one is testing speech recognition software by directly speaking into the computer,
real-time performance is important. It is less impor tant, on the other hand, if the test procedure
involves running batch scripts on a database of speech files.
Ifthespeechsoftwareis designed totakeinput directlyfromdevices(orpipes,inthecaseof Unix),
then the issue becomes one of CPU speed.
50.9.5 Source Availability
It is unfortunate but true that the best documentation for a given speech processing command is
oftenthesourcecode. Thus,theavailabilityofsourcecodemaybeanimportantfactorforthisreason
alone. Typically, this is more important when the software is used in advanced R&D applications.
Sourcesalsoareneededifusershaverequirementstoportthespeechsoftwaretoadditionalplatforms.
Source availability may also be important for extensibility, since it may not be possible to extend the
speech software without the sources.
If the speechsoftware is interpreter-based, sources of interest will include the sources for any
built-in operations that are implemented as interpreter scripts.
50.9.6 Hardware Requirements
Speechsoftwaremayrequiretheinstallationofspecialpurposehardware. Therearetwomainreasons
for such requirements: to accelerate particular computations (e.g., spectrograms), and to provide
speech I/O w ith A/D and D/A converters.
Such hardware has several disadvantages. It adds to the system cost, and it decreases the overall
reliability of the system. It may also constrain system software upgrades; for example, the extra
hardware may use special device drivers that do not survive OS upgrades. Special purpose hardware
used to be common, but is less so now owing to the continuing increase in CPU speeds and the
prevalenceofbuilt-inaudioI/O.Itisstillimportant,however,whenmaximumspeedandhigh-quality
audio I/O are important. CSL is a good example of an integrated hardware/software approach.
50.9.7 Cross-Platform Compatibility
If your hardware platform may change or your site has a variety of platforms, then it is important
to consider whether the speechsoftware is available across a variety of platforms. Source availability
(Section 50.9.5) is relevant here.
Ifyouintendtorunthespeechsoftwareonseveralplatformsthathavedifferentunderlyingnumeric
representations(a byte order differencebeing most likely), then it is important to know whether the
file formats and signal I/O software support transparent data exchange.
50.9.8 Degree of Specialization
Somespeechsoftwareisintendedforgeneralpurposeworkinspeech(e.g.,ESPS/waves,MATLAB
TM
).
Other software is intended for more specialized usage. Some of the areas where specialized software
tools may be relevant include linguistics, recognition, synthesis, coding, psycho-acoustics, clinical-
voice, music, multi-media, sound and vibration, etc. Two examples are HTK for recognition, and
Delta (from Eloquent Technology) for synthesis.
c
1999 by CRC Press LLC
[...]... interface (50. 6.2) • Memory-based (50. 9.1) • • Parametric control (50. 6.3) • • Consistency maintenance (50. 8.0) • • • • File-based (50. 9.1) • • History documentation (50. 9.2) • • Extensibility (50. 7.0) • • Personalization (50. 9.3) • • Real-time performance (50. 9.4) • • • • Cross-platform compatibility (50. 9.7) • • • Support forspeech I/O (50. 9.9) • • Source availability (50. 9.5) 50. 13 Sources for Finding... (e.g., SOX) Forspeech sampled data, the most important file format is Sphere (from NIST), which is used in the speech databases available from the Linguistic Data Consortium (LDC) Sphere supports several data compression formats in a variety of standard and specialized formats Sphere works well for sampled data files, but is limited for more general speech data files A general purpose, public-domain format... Entropic 50. 11 Speech Databases Numerous databases (or corpora) of speech are available from various sources For a current list, see the comp .speech Frequently Asked Questions (FAQ) (see Section 50. 13) The largest supplier of speech data is the Linguistic Data Consortium, which publishes a large number of CDs containing speechand linguistic data 50. 12 Summary of Characteristics and Uses In Section 50. 1,... Nevertheless, Table 50. 1 is a reasonable starting point for evaluating particular software in the context of intended use TABLE 50. 1 Relative Importance of Software Characteristics Teaching Interactive exploration • OS-based (50. 3.1) Workspace-based (50. 3.2) • • Compute-oriented (50. 4.1) Display-oriented (50. 4.2) Batch experiments • • • Compiled (50. 5.2) • Interpreted (50. 5.1) • Text-based interface (50. 6.1) Visual.. .50. 9.9 Support forSpeech Input and Output In the past, built-in speech I/O hardware was uncommon in workstations and PCs, so speechsoftware typically supported speech I/O by means of add-on hardware supplied with the software or available from other third parties This provided the desired capability, albeit with the disadvantages mentioned earlier (see Section 50. 9.6) Today most workstations and. .. in addition to a waveform itself?) The best way to design speech file formats is hotly debated, but the clear trend has been towards “self-describing” file formats that include information about the names, data types, and layout of all data in the file (For example, this permits programs to retrieve data by name.) There are many popular file formats, and various programs are available for converting among... In Section 50. 1, we mentioned that the three most common uses forspeechsoftware are teaching, interactive exploration, and batch experiments And at various points during the discussion of speechsoftware characteristics, we mentioned their relative importance for the different classes of software uses We attempt to summarize this in Table 50. 1, where the symbol “•” indicates that a characteristic is... will be used, for example, to show aspects of vector quantization or HMM clustering.) Public-domain file formats will dominate proprietary formats Networked computers will be used for parallel computation if available Tcl/Tk and Java will grow in popularity as a base for graphical data displays and user interfaces References [1] Henke, W.L., Speechand audio computer-aided examination and analysis facility,... MIT Research Laboratory for Electronics, 1969, 69–73 [2] Henke, W.L., MITSYN — An interactive dialogue language for time signal processing, MIT Research Laboratory for Electronics, report RLE TM-1, 1975 [3] Kopec, G., The integrated signal processing system ISP, IEEE Trans on Acoustics, Speech, and Signal Processing, ASSP-32(4), 842-851, Aug 1984 [4] Pino, J.L., Ha, S., Lee, E.A and Buck, J.T., Software. .. best single online source of general information is the Internet news group comp .speech, and in particular its FAQ (see http://svr-www.eng.cam.ac.uk/comp .speech/ ) Use this as a starting point Here are some other WWW sites that (at this writing) contain speechsoftware information or pointers to other sites: http://svr-www.eng.cam.ac.uk http://mambo.ucsc.edu/psl /speech. html http://www.bdti.com/faq/dsp_faq.html . 1999
c
1999byCRCPressLLC
50
Software Tools for Speech Research
and Development
John Shore
Entropic Research
Laboratory, Inc.
50. 1 Introduction
50. 2 Historical Highlights
50. 3. Performance
•
Source
Availability
•
HardwareRequirements
•
Cross-PlatformCom-
patibility
•
DegreeofSpecialization
•
SupportforSpeechInput
and Output
50. 10 File Formats (Data Import/Export)
50. 11 Speech Databases
50. 12 Summary of Characteristics and Uses
50. 13 Sources for