1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Tài liệu Digital Signal Processing Handbook P50 ppt

13 253 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 13
Dung lượng 168,01 KB

Nội dung

Shore, J. “Software Tools for Speech Research and Development” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c  1999byCRCPressLLC 50 Software Tools for Speech Research and Development John Shore Entropic Research Laboratory, Inc. 50.1 Introduction 50.2 Historical Highlights 50.3 The User’s Environment (OS-Based vs. Workspace-Based) Operating-System-Based Environment • Workspace-Based Environment 50.4 Compute-Oriented vs. Display-Oriented Compute-Oriented Software • Display-Oriented Software • Hybrid Compute/Display-Oriented Software 50.5 Compiled vs. Interpreted Interpreted Software • Compiled Software • Hybrid Inter- preted/Compiled Software • Computation vs. Display 50.6 Specifying Operations Among Signals Text-Based Interfaces • Visual(“Point-and-Click”) Interfaces • Parametric Control of Operations 50.7 Extensibility (Closed vs. Open Systems) 50.8 Consistency Maintenance 50.9 Other Characteristics of Common Approaches Memory-basedvs. File-based • DocumentationofProcessing History • Personalization • Real-Time Performance • Source Availability • HardwareRequirements • Cross-PlatformCom- patibility • DegreeofSpecialization • SupportforSpeechInput and Output 50.10 File Formats (Data Import/Export) 50.11 Speech Databases 50.12 Summary of Characteristics and Uses 50.13 Sources for Finding Out What is Currently Available 50.14 Future Trends References 50.1 Introduction Experts in every field of study depend on specialized tools. In the case of speech research and development, the dominant tools today are computer programs. In this article, we present an overview of key technical approaches and features that are prevalent today. We restrict the discussion to software intended to suppor t R&D, as opposed to software for com- mercial applications of speech processing. For example, we ignore DSP programming (which is discussed in the previous article). Also, we concentrate on software intended to support the special- c  1999 by CRC Press LLC ities of speech analysis, coding, synthesis, and recognition, since these are the main subjects of this chapter. However, much of what we have to say applies as well to the needs of those in such closely related areas as psycho-acoustics, clinical voice analysis, sound and vibration, etc. We do not attempt to survey available software packages, as the result would likely be obsolete by thetimethisbookis printed. Theexamplesmentioned areillustrative,andnot intendedtoprovidea thorough or balanced review. Our aim is to provide sufficient background so that readers can assess their needs and understand the differences among available tools. Up-to-date surveys are readily available online (see Section 50.13). In general, there are three common uses of speech R&D software: • Teaching, e.g., homework assignments for a basic course in speech processing • Interactive, free-form exploration, e.g., designing a filter and evaluating its effects on a speech processing system • Batch experiments, e.g., training and testing speech coders or speech recognizers using a large database The relative importance of various features differs among these uses. For example, in conducting batchexperiments, itisimportantthatlargesignalscanbehandled,andthat complicatedalgorithms executeefficiently. For teaching, on the other hand, these features are less important than simplicity, quickexperimentation,andease-of-use. Becauseof practical limitations, suchdifferencesinpriority mean that no one software package today can meet all needs. To explain the variation among current approaches, we identify a number of distinguishing char- acteristics. Thesecharacteristics arenotindependent(i.e.,thereisconsiderable overlap),buttheydo help to present the overall view. For simplicity, we will refer to any particular speech R&D software as “the speech software”. 50.2 Historical Highlights Earlyorsignificantexamplesof speechR&Dsoftwareinclude“VisibleSpeech”[5],MITSYN[1], and Lloyd Rice’s WA VE program of the mid 1970s (not to be confused with David Talkin’s waves [8]). The first general, commercial system that achieved widespread acceptance was the Interactive Laboratory System (ILS) from Signal Technology Incorporated, which was popular in the late 1970s and early 1980s. Using the terminology defined below, ILS is compute-oriented software with an operating-system-based environment. The first popular, display-oriented, workspace-based speech software was David Shipman’s LISP-machine application called Spire [6]. 50.3 The User’s Environment (OS-Based vs. Workspace-Based) In some cases, the user sees the speech software as an extension of the computer’s operating system. We call this “operating-system-based” (or OS-based); an example is the Entropic Signal Processing System (ESPS)[7]. In other cases, the software provides its own operating environment. We call this “workspace- based” (from the term used in implementations of the programming language APL); an example is MATLAB TM (from The Mathworks). c  1999 by CRC Press LLC 50.3.1 Operating-System-Based Environment Inthisapproach,signals arerepresented as files under the native operating system (e.g., Unix, D OS), and the software consists of a set of programs that can be invoked separately to process or display signals in various ways. Thus, the user sees the software as an extension of an already-familiar oper- ating system. Because signals are represented as files, the speech software inherits file manipulation capabilities from the operating system. Under Unix, for example, signals can be copied and moved respectively using the cp and mv programs, and they can be organized as directory trees in the Unix hierarchical file system (including NFS). Similarly, the speech software inherits extension capabilities inherent in the operating system. UnderUnix,forexample,extensionscanbecreatedusingshellscriptsinvariouslanguages(sh,csh,Tcl, perl, etc.),aswellassuchfacilitiesaspipesandremoteexecution. OS-basedspeechsoftwarepackages are often called command-line packages because usage typically involves providing a sequence of commands to some type of shell. 50.3.2 Workspace-Based Environment In this approach, the user interacts with a single application program that takes over from the operating system. Signals, which may or may not correspond to files, are typically represented as variables in some kind of virtual space. Various commands are available to process or display the signals. Such a workspace is often analogous to a personal blackboard. Workspace-based systems usually offer means for saving the current workspace contents and for loading previously saved workspaces. Anextensionmechanismistypicallyprovidedbyacommandinterpreterforasimplelanguagethat includes the available operations and a means for encapsulating and invoking command sequences (e.g., in a function or procedure definition). In effect, the speech software provides its own shell to the user. 50.4 Compute-Oriented vs. Display-Oriented This distinction concerns whether the speech software emphasizes computation or visualization or both. 50.4.1 Compute-Oriented Software If there is a large number of signal processing operations relative to the number of signal display operations, we say that the software is compute-oriented. Such software typically can be oper ated withoutadisplaydeviceandtheuserthinksofitprimarilyasacomputationpackagethatsupportssuch functions as spectral analysis, filtering, linear prediction, quantization, analysis/synthesis, pattern classification, Hidden Markov Model (HMM) training, speech recognition, etc. Compute-oriented software can be either OS-based or workspace based. Examples include ESPS, MATLAB TM , and the Hidden Markov Model Toolkit (HTK) (from Cambridge University and En- tropic). 50.4.2 Display-Oriented Software In contrast, display-oriented speech software is not intended to and often cannot operate without a display device. The primary purpose is to support visual inspection of waveforms, spectrograms, and other parametric representations. The user typically interacts with the software using a mouse or other pointing device to initiate display operations such as scrolling, zooming, enlarging, etc. c  1999 by CRC Press LLC While the software mayalso providecomputationsthatcanbe performed on displayedsignals (or marked segments of displayed signals), the user thinks of the software as supporting visualization more than computation. An example is thewaves program [8]. 50.4.3 Hybrid Compute/Display-Oriented Software Hybrid compute/display software combines the best of both. Interactions are typically by means of a display device, but computational capabilities are rich. The computational capabilities may be built-in to workspace-based speech software, or may be OS-based but accessible from the display program. Examples include the Computerized Speech Lab (CSL) from Kay Elemetrics Corp., and the combination of ESPS and waves. 50.5 Compiled vs. Interpreted Here we distinguish accordingto whether the bulk of the sig nal processing or display code (whether written by developers or users) is interpreted or compiled. 50.5.1 Interpreted Software The interpreter language may be specially designed for the software (e.g., S-PLUS from Statistical Sciences, Inc., and MATLAB TM ), or may be an existing, general purpose language (e.g., LISP is used in N!Power from Signal Technology, Inc.). Compared to compiler languages, inter preter languages tend to be simpler and easier to learn. Furthermore, it is usually easier and faster to write and test programs under an interpreter. The disadvantage, relative to compiled languages, is that the resulting programs can be quite slow to run. Asaresult,interpretedspeechsoftwareisusuallybettersuitedforteachingandinteractiveexploration than for batch experiments. 50.5.2 Compiled Software Comparedtointerpretedlanguages, compiledlanguages(e.g.,FORTRAN,C,C++)tendtobemore complicatedandhardertolearn. Comparedtointerpretedprograms, compiledprograms areslower to write and test, but considerably faster to run. As a result, compiled speech software is usually better suited for batch experiments than for teaching. 50.5.3 Hybrid Interpreted/Compiled Software Someinterpretersmakeitpossibletocreatenew language commandswithanunderlyingimplemen- tation that is compiled. This allows a hybrid approach that can combine the best of both. Some languages provide a hybrid approach in which the source code is pre-compiled quickly into intermediate code that is then (usually!) interpreted. Java is a good example. If compiled speech software is OS-based, signal processing scripts can typically be written in an interpretivelanguage(e.g.,ashscriptcontainingasequenceofcallstoESPS programs). Thus,hybrid systems can also be based on compiled software. 50.5.4 Computation vs. Display Thedistinctionbetweencompiled and interpretedlanguagesisrelevant mostly tothecomputational aspects of the speech software. However, the distinction can apply as well to display software, since c  1999 by CRC Press LLC somedisplayprogramsarecompiled(e.g.,usingMotif)whileothersexploitinterpreters(e.g.,Tcl/Tk, Java). 50.6 Specifying Operations Among Signals Here we are concernedwith the means by which users specify what operations are to be doneand on whatsignals. Thisconsiderationisrelevanttohowspeechsoftwarecanbeextendedwithuser-defined operations (see Section 50.7), but is an issue even in software that is not extensible. The main distinction is between a text-based interface and a visual (“point-and-click”) interface. Visual interfaces tend to be less general but easier to use. 50.6.1 Text-Based Interfaces Traditional interfaces for specifying computations are based on a textual-representation in the form of scripts and programs. For OS-based speech software, operations are typically specified by typing the name of a command (with possible options) directly to a shell. One can also enter a sequence of such commands into a text editor when preparing a script. This style of specifying operations also is available for workspace-based speech software that is based on a command interpreter. In this case, the text comprises legal commands and programs in the interpreter language. Both OS-based and workspace-based speech software may also permit the specification of opera- tions using source code in a high-level language (e.g., C) that gets compiled. 50.6.2 Visual (“Point-and-Click”) Interfaces Thepoint-and-clickapproachhasbecometheubiquitoususer-interfaceofthe1990s. Operationsand operands (signals) arespecifiedbyusingamouseorotherpointing devicetointeractwith on-screen graphical user-interface (GUI) controls such as buttons and menus. The interface may also have a text-based component to allow the direct entry of parameter values or formulas relating signals. Visual Interfaces for Display-Oriented Software In display-oriented software, the signals on which operations are to be performed are visible as waveforms or other directly representative graphics. A typical user-interaction proceeds as follows: A relevant signal is specified by a mouse-click operation (if a signal segment is involved, it is selected by a click-and-drag operation or by a pair of mouse-click operations). Theoperation to be performed is then specified by mouse click operations on screen buttons, pull-down menus, or pop-up menus. This style works very well for unary operations (e.g., compute and display the spectrogram of a given signal segment), and moderately well for binary operations (e.g., add two signals). But it is awkward for operations that have more than two inputs. It is also awkward for specifying chained calculations, especially if you want to repeat the calculations for a new set of signals. Onesolutiontotheseproblemsisprovided bya“calculator-style”interfacethatlooksandactslike a familiar arithmetic calculator (except the operands are sig nal names and the operations are signal processing operations). Another solution is the “spreadsheet-style” interface. The analogy with spreadsheets is tight. Imagine a spreadsheet in which the cells are replaced by images (waveforms, spectrograms, etc.) connectedlogicallybyformulas. Forexample,onecellmightshowatestsignal,asecondmightshow the results of filtering it, and a third might show a spectrogram of a portion of the filtered signal. This exemplifies a spreadsheet-style interface for speech software. c  1999 by CRC Press LLC A spreadsheet-style interface provides some means for specifying the “formulas” that relate the various “cells”. This formula interface might itself be implemented in a point-and-click fashion, or it might permit direct entry of formulas in some interpretive language. Speech software with a spreadsheet-style interface will maintain consistency among the visible signals. Thus, if one of the signals is edited or replaced, the other signal graphics change correspondingly, according to the underlying formulas. DADisp (from DSP De velopment Corporation) is an example of a spreadsheet-style interface. Visual Interfaces for Compute-Oriented Software In a visual interface for display-oriented software, the focus is on the signals themselves. In a visual interface for compute-oriented software, on the other hand, the focus is on the operations. Operations among sig nals typically arerepresented as iconswith one or more input and output lines that interconnect the operations. In effect, the representation of a signal is reduced to a straight line indicating its relationship (input or output) with respect to operations. Such visual interfaces are often called block-diagram interfaces. In effect, a block-diag ram interface provides a visual representation of the computation chain. Various point-and-click means are provided to support the user in creating, examining, and modifying block diagrams. Ptolomy [4] and N!Power are examples of systems that provide a block-diagram interface. Limitations of Visual Interfaces Although much in vogue, visual interfaces are inherently limited as a means for specifying signal computations. For example, the analogy between spreadsheets and spreadsheet-style speech software continues. For simple signal computations, the spreadsheet-style interface can be very useful; computations are simple to set up and informative when operating. For complicated computations, however, the spreadsheet-style interface inherits all of the worst features of spreadsheet programming. It is difficulttoencapsulatecommonsub-calculations,anditisdifficulttoorganizethe “program”sothat the computationalstructure is self-e vident. The resultis that spreadsheet-style programs are hard to write, hard to read, and error-prone. Inthisrespect,block-diagram interfacesdoabetterjobsincetheirmain focus is ontheunderlying computation rather than on the signals themselves. Thus, screen “real-estate” is devoted to the computation rather than to the signal graphics. However, as the complexity of computations grows, the geometric and visual approach eventually becomes unwieldy. Whenwas the last time you used a flowchart to design or document a program? Itfollowsthatvisual interfaces for specifying computations tend to be best suitedfor teachingand interactive exploration. 50.6.3 Parametric Control of Operations Speechprocessingoperations oftenarebasedoncomplicatedalgorithmswithnumerousparameters. Consequently, the means for specifying parameters is an important issue for speech software. The simplest form of parametric control is provided by command-lineoptions on command-line programs. This is convenient, but can be cumbersome if there are many parameters. A common alternative is to read parameter values from parameter files that are prepared in advance. Typically, command-line values can be used to override values in the parameter file. A third input source for parameter values is directly from the user in response to prompts issued by the program. Some systems offer the flexibility of a hierarchy of inputs for parameter values, for example: • default values c  1999 by CRC Press LLC • values from a global parameter file read by all programs • values from a program-specific parameter file • values from the command line • values from the user in response to run-time prompts In some situations, it is helpful if a current default value is replacedby the most recent input from a given parameter source. We refer to this property as “parameter persistence”. 50.7 Extensibility (Closed vs. Open Systems) Speech software is “closed” if there is no provision for the user to extend it. There is a fixed set of operations available to process and display signals. What you get is all you get. OS-based systemsarealwaysextensibleto a degree because they inherit scripting capabilities from theOS,whichpermitsthecreationofnewcommands. Theymayalsoprovideprogramminglibraries so that the user can write and compile new programs and use them as commands. Workspace-basedsystemsmaybeextensibleiftheyarebasedonaninterpreterwhoseprogramming language includes the concept of an encapsulated procedure. If so, then users can write scripts that define new commands. Some systems also allow the interpreterto be extendedwith commands that are implemented by underlying code in C or some other compiled language. In general, for speech software to be extensible, it must be possible to specify operations (see Section 50.6) and also to re-use the resulting specifications in other contexts. A block-diagram interface is extensible, for example, if a given diagram can be reduced to an icon that is available for use as a single block in another diagram. For speech software with visual interfaces, extensibility considerations also include the ability to specifynewGUIcontrols(visiblemenusandbuttons),theabilitytotiearbitraryinternalandexternal computations to GUI controls, and the ability to define new display methods for new signal types. Ingeneral,extendedcommandsmaybehavedifferentlyfromthebuilt-incommandsprovidedwith the speech software. For example, built-in commands may share a common user interface that is difficulttoimplement inanindependentscriptorprogram (suchacommoninterfacemightprovide standard parameters for debug control, standard processing of parameter files, etc.). If user-defined scripts, programs, and GUI components are indistinguishable from built-in facili- ties, we say that the speech software provides seamless extensibility. 50.8 Consistency Maintenance A speech processing chain involves signals, operations, and parameter sets. An important consid- eration for speech software is whether or not consistency is maintained among all of these. Thus, for example, if one input signal is replaced with another, are all intermediate and output signals recalculated automatically? Consistency maintenance is primarily an issue for speech software with visual interfaces,namelywhether ornotthesoftwareguaranteesthat allaspectsofthevisibledisplays are consistent w ith each other. Spreadsheet-style interfaces (for display-oriented software) and block-diagram interfaces (for compute-oriented software) usually provide consistency maintenance. c  1999 by CRC Press LLC 50.9 Other Characteristics of Common Approaches 50.9.1 Memory-based vs. File-based “Memory-based” speech software carries out all of its processing and display operations on signals that are stored entirely within memory,regardless of whether or not the signals also have an external representation as a disk file. This approach has obvious limitations with respect to signal size, but it simplifies programming and yields fast operation. Thus, memor y-based software is well-suited for teaching and the interactive exploration of small samples. In“file-based”speechsoftware,ontheotherhand,signalsarerepresentedandmanipulatedasdisk files. The software partially buffers portions of the signal in memory as required for processing and display operations. Although programming can be more complicated, the advantage is that there are no inherent limitations on signal size. The file-based approach is, therefore, well-suited for large scale experiments. 50.9.2 Documentation of Processing History Modernspeechprocessinginvolvescomplicatedalgorithmswithmanyprocessingstepsandoperating parameters. As a result, it is often important to be able to reconstruct exactly how a given signal was produced. Speech software can help here by creating appropriate records as signal and parameter files are processed. The most common method for recording this information about a given signal is to put it in the same file as the signal. Most modern speech software uses a file format that includes a “file header” that is used for this pur pose. Most systems store at least some information in the header, e.g., the sampling rate of the signal. Others, such as ESPS, attempt to store all relevant information. In this approach, the header of a signal file produced by any program includes the program name, values of processing parameters, and the names and headers of all source files. The header is a recursive structure,sothattheheadersof thesourcefilesthemselvescontainthenamesandheadersoffilesthat wereprior sources. Thus,a signal file header containsthe headers of all source files in the processing chain. It follows that files contain a complete history of the origin of the data in the file and all the intermediate processing steps. The importance of record keeping grows with the complexity of computation chains and the extent of available parametric control. 50.9.3 Personalization There is considerable variation in the extent to which speech software can be customized to suit personal requirements and tastes. Some systems cannot be personalized at all; they start out the same way, every time. But most systems store personal preferences and use them again next time. Savable preferences may include color selections, button layout, button semantics, menu contents, currentlyloadedsignals,visiblewindows,windowarrangement,anddefaultparametersetsforspeech processing operations. Attheextreme, some systemscansaveacomplete“snapshot”thatpermitsexactresumption. This isparticularlyimportantfortheinteractivestudyofcomplicatedsignalconfigurationsacrossrepeated software sessions. 50.9.4 Real-Time Performance Software is generally described as “real-time” if it is able to keep up with relevant, changing inputs. In the case of speech software, this usually means that the software can keep up with input speech. Even this definition is not particularly meaning ful unless the input speech is itself coming from a c  1999 by CRC Press LLC human speaker and digitized in real-time. Otherwise, the real-issue is whether or not the software is fast enough to keep up with interactive use. For example, if one is testing speech recognition software by directly speaking into the computer, real-time performance is important. It is less impor tant, on the other hand, if the test procedure involves running batch scripts on a database of speech files. Ifthespeechsoftwareis designed totakeinput directlyfromdevices(orpipes,inthecaseof Unix), then the issue becomes one of CPU speed. 50.9.5 Source Availability It is unfortunate but true that the best documentation for a given speech processing command is oftenthesourcecode. Thus,theavailabilityofsourcecodemaybeanimportantfactorforthisreason alone. Typically, this is more important when the software is used in advanced R&D applications. Sourcesalsoareneededifusershaverequirementstoportthespeechsoftwaretoadditionalplatforms. Source availability may also be important for extensibility, since it may not be possible to extend the speech software without the sources. If the speech software is interpreter-based, sources of interest will include the sources for any built-in operations that are implemented as interpreter scripts. 50.9.6 Hardware Requirements Speechsoftwaremayrequiretheinstallationofspecialpurposehardware. Therearetwomainreasons for such requirements: to accelerate particular computations (e.g., spectrograms), and to provide speech I/O w ith A/D and D/A converters. Such hardware has several disadvantages. It adds to the system cost, and it decreases the overall reliability of the system. It may also constrain system software upgrades; for example, the extra hardware may use special device drivers that do not survive OS upgrades. Special purpose hardware used to be common, but is less so now owing to the continuing increase in CPU speeds and the prevalenceofbuilt-inaudioI/O.Itisstillimportant,however,whenmaximumspeedandhigh-quality audio I/O are important. CSL is a good example of an integrated hardware/software approach. 50.9.7 Cross-Platform Compatibility If your hardware platform may change or your site has a variety of platforms, then it is important to consider whether the speech software is available across a variety of platforms. Source availability (Section 50.9.5) is relevant here. Ifyouintendtorunthespeechsoftwareonseveralplatformsthathavedifferentunderlyingnumeric representations(a byte order differencebeing most likely), then it is important to know whether the file formats and signal I/O software support transparent data exchange. 50.9.8 Degree of Specialization Somespeechsoftwareisintendedforgeneralpurposeworkinspeech(e.g.,ESPS/waves,MATLAB TM ). Other software is intended for more specialized usage. Some of the areas where specialized software tools may be relevant include linguistics, recognition, synthesis, coding, psycho-acoustics, clinical- voice, music, multi-media, sound and vibration, etc. Two examples are HTK for recognition, and Delta (from Eloquent Technology) for synthesis. c  1999 by CRC Press LLC [...]... interactive dialogue language for time signal processing, MIT Research Laboratory for Electronics, report RLE TM-1, 1975 [3] Kopec, G., The integrated signal processing system ISP, IEEE Trans on Acoustics, Speech, and Signal Processing, ASSP-32(4), 842-851, Aug 1984 [4] Pino, J.L., Ha, S., Lee, E.A and Buck, J.T., Software synthesis for DSP using ptolemy, J VLSI Signal Processing, 9(1), 7-21, Jan 1995 [5]... Visible Speech, D Van Nostrand Company, New York, 1946 [6] Shipman, D., SpireX: Statistical analysis in the SPIRE acoustic-phonetic workstation, Proc ICASSP, Boston, 1983 [7] Shore, J., Interactive signal processing with UNIX, Speech Technol., 3, March/April 1988 [8] Talkin, D., Looking at speech, Speech Technol., 4, April/May 1989 c 1999 by CRC Press LLC ... hardware may still be needed, including: • need for more than two channels • need for very high sampling rates • compatibility with special hardware (e.g., DAT tape) 50.10 File Formats (Data Import/Export) Signal file formats are fundamentally important because they determine how easy it is for independent programs to read and write the files (interoperability) Furthermore, the format determines whether files... compression formats in a variety of standard and specialized formats Sphere works well for sampled data files, but is limited for more general speech data files A general purpose, public-domain format (Esignal) has recently been made available by Entropic 50.11 Speech Databases Numerous databases (or corpora) of speech are available from various sources For a current list, see the comp.speech Frequently . Shore, J. “Software Tools for Speech Research and Development” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton:. language for time signal processing, MIT Research Laboratory for Electronics, report RLE TM-1, 1975. [3] Kopec, G., The integrated signal processing system

Ngày đăng: 25/01/2014, 13:20

TỪ KHÓA LIÊN QUAN