aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDaniel Povey2017-08-21 18:06:11 -0500
committerDaniel Povey2017-08-21 18:06:11 -0500
commit92dad5b6542bdbe0ef876b73cebbd16c391c56bf (patch)
treea5bd34103a7a96e6ad810c1491af723b63febd80
parent91d47c1dc4cee5b28a2c13b1c63bbce8ecc8a54d (diff)
downloadkaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.tar.gz
kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.tar.xz
kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.zip
[doc] Add some terms to glossary
-rw-r--r--src/doc/glossary.dox30
-rw-r--r--src/doc/graph.dox84
2 files changed, 72 insertions, 42 deletions
diff --git a/src/doc/glossary.dox b/src/doc/glossary.dox
index 8bb8283e5..e1b3c01f7 100644
--- a/src/doc/glossary.dox
+++ b/src/doc/glossary.dox
@@ -50,11 +50,41 @@ contain alignment information as sequences of transition-ids for each word
50sequence in the lattice. The program \ref bin/show-alignments.cc "show-alignments" shows 50sequence in the lattice. The program \ref bin/show-alignments.cc "show-alignments" shows
51alignments in a human-readable format. 51alignments in a human-readable format.
52 52
53
54<b>:cost:</b> Any quantity which is used as a 'cost' in a weighted FST
55 algorithm (e.g. acoustic cost, language model cost; see \ref lattices
56 for more details). Costs are, generally speaking, interpretable
57 as a negative log of a likelihood or probability, but there may
58 be scaling factors involved.
59
53<b>:forced alignment:</b> see <b>alignment</b>. 60<b>:forced alignment:</b> see <b>alignment</b>.
54 61
55<b>:lattice:</b> A representation of alternative likely transcriptions of an utterance, together 62<b>:lattice:</b> A representation of alternative likely transcriptions of an utterance, together
56 with associated alignment and cost information. See \ref lattices. 63 with associated alignment and cost information. See \ref lattices.
57 64
65
66<b>:likelihood:</b> A mathematical concept meaning the value of a function representing
67 the distribution of a continuous value. These can be more than one. Often represented
68 in log space (as log-likelihood) because likelihood values of multi-dimensional
69 features can often be too small or large to fit in standard floating-point precision.
70 With standard cross-entropy trained neural net systems we obtain "pseudo-likelihoods"
71 by dividing log-probabilities by the priors of context-dependent states.
72
73<b>:posterior:</b> "Posterior" is shorthand for "posterior probability" which is a very
74 general mathematical concept, generally meaning "the probability of some random
75 variable after seeing the relevant data". In general, posteriors will sum to one.
76 In Kaldi terminology, if you encounter the term "posterior", abbreviated to "post",
77 without further expanation it generally means the per-frame posterior probability of
78 transition-ids. However these posteriors may be very peaky (i.e. mostly ones and zeros)
79 depending how you obtained them, e.g. from a lattice or from an alignment.
80 Alignments and lattices can be converted to posteriors over transition-ids (see \ref lattice-to-post.cc),
81 or over lattice arcs (see \ref ali-to-post.cc and \ref lattice-arc-post.cc).
82 Posteriors over transition-ids can be converted to posteriors over pdf-ids or over phones;
83 see the tools \ref ali-to-post.cc, \ref post-to-pdf-post.cc and \ref post-to-phone-post.cc.
84
85<b>:pdf-id:</b> The zero-based integer index of a clustered context-dependent HMM state; see
86 \ref transition_model_identifiers for more information.
87
58<b>:transition-id:</b> a one-based index that encodes the pdf-id (i.e. the clustered context-dependent HMM state), 88<b>:transition-id:</b> a one-based index that encodes the pdf-id (i.e. the clustered context-dependent HMM state),
59the phone identity, and information about whether we took the self-loop or forward transition in the HMM. 89the phone identity, and information about whether we took the self-loop or forward transition in the HMM.
60Appears in lattices, decoding graphs and alignments. See \ref transition_model. 90Appears in lattices, decoding graphs and alignments. See \ref transition_model.
diff --git a/src/doc/graph.dox b/src/doc/graph.dox
index 849abd3e7..cf0f99acb 100644
--- a/src/doc/graph.dox
+++ b/src/doc/graph.dox
@@ -24,35 +24,35 @@ namespace fst {
24 \page graph Decoding graph construction in Kaldi 24 \page graph Decoding graph construction in Kaldi
25 25
26 Firstly, we cannot hope to introduce finite state transducers and how they 26 Firstly, we cannot hope to introduce finite state transducers and how they
27 are used in speech recognition. For that, see 27 are used in speech recognition. For that, see
28 <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition 28 <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition
29 with Weighted Finite-State Transducers" </a> by Mohri, Pereira and 29 with Weighted Finite-State Transducers" </a> by Mohri, Pereira and
30 Riley (in Springer Handbook on SpeechProcessing and Speech Communication, 2008). 30 Riley (in Springer Handbook on SpeechProcessing and Speech Communication, 2008).
31 The general approach we use is as described there, but some of the details, 31 The general approach we use is as described there, but some of the details,
32 particularly with respect to how 32 particularly with respect to how
33 we handle disambiguation symbols and how we deal with weight-pushing, differ. 33 we handle disambiguation symbols and how we deal with weight-pushing, differ.
34 34
35 \section graph_overview Overview of graph creation 35 \section graph_overview Overview of graph creation
36 36
37 The overall picture for decoding-graph creation is that we 37 The overall picture for decoding-graph creation is that we
38 are constructing the graph HCLG = H o C o L o G. Here 38 are constructing the graph HCLG = H o C o L o G. Here
39 39
40 - G is an acceptor (i.e. its input and output symbols are the same) 40 - G is an acceptor (i.e. its input and output symbols are the same)
41 that encodes the grammar or language model. 41 that encodes the grammar or language model.
42 - L is the lexicon; its output symbols are words and its input 42 - L is the lexicon; its output symbols are words and its input
43 symbols are phones. 43 symbols are phones.
44 - C represents the context-dependency: its output symbols are phones 44 - C represents the context-dependency: its output symbols are phones
45 and its input symbols represent context-dependent phones, i.e. windows 45 and its input symbols represent context-dependent phones, i.e. windows
46 of N phones; see \ref tree_window. 46 of N phones; see \ref tree_window.
47 - H contains the HMM definitions; its output symbols represent context-dependent 47 - H contains the HMM definitions; its output symbols represent context-dependent
48 phones and its input symbols are transitions-ids, which encode the pdf-id and 48 phones and its input symbols are transition-ids, which encode the pdf-id and
49 other information (see \ref transition_model_identifiers) 49 other information (see \ref transition_model_identifiers)
50 50
51 This is the standard recipe. However, there are a lot of details to be 51 This is the standard recipe. However, there are a lot of details to be
52 filled in. We want to ensure that the output is determinized and minimized, 52 filled in. We want to ensure that the output is determinized and minimized,
53 and in order for HCLG to be determinizable we have to insert disambiguation 53 and in order for HCLG to be determinizable we have to insert disambiguation
54 symbols. For details on the disambiguation symbols, see below 54 symbols. For details on the disambiguation symbols, see below
55 \ref graph_disambig. 55 \ref graph_disambig.
56 56
57 We also want to ensure the HCLG is stochastic, as far as possible; in the 57 We also want to ensure the HCLG is stochastic, as far as possible; in the
58 conventional recipes, this is done (if at all) with the "push-weights" 58 conventional recipes, this is done (if at all) with the "push-weights"
@@ -72,7 +72,7 @@ namespace fst {
72 as long as G was stochastic. Of course, G will typically not be quite 72 as long as G was stochastic. Of course, G will typically not be quite
73 stochastic, because of the way Arpa language models with backoff are represented in FSTs, 73 stochastic, because of the way Arpa language models with backoff are represented in FSTs,
74 but at least our approach ensures that the non-stochasticity "stays 74 but at least our approach ensures that the non-stochasticity "stays
75 put" and does not get worse than it was at the start; this approach avoids 75 put" and does not get worse than it was at the start; this approach avoids
76 the danger of the "push-weights" operation failing or making things worse. 76 the danger of the "push-weights" operation failing or making things worse.
77 77
78 \section graph_disambig Disambiguation symbols 78 \section graph_disambig Disambiguation symbols
@@ -80,7 +80,7 @@ namespace fst {
80 Disambiguation symbols are the symbols \#1, \#2, \#3 and so on that 80 Disambiguation symbols are the symbols \#1, \#2, \#3 and so on that
81 are inserted at the end of phonemene sequences in the lexicon. When 81 are inserted at the end of phonemene sequences in the lexicon. When
82 a phoneme sequence is a prefix of another phoneme sequence in the lexicon, 82 a phoneme sequence is a prefix of another phoneme sequence in the lexicon,
83 or appears in more than one word, it needs to have one of 83 or appears in more than one word, it needs to have one of
84 these symbols added after it. These symbols are needed to ensure that 84 these symbols added after it. These symbols are needed to ensure that
85 the product L o G is determinizable. We also insert disambiguation symbols 85 the product L o G is determinizable. We also insert disambiguation symbols
86 in two more places. We have a symbol \#0 on the backoff arcs in the 86 in two more places. We have a symbol \#0 on the backoff arcs in the
@@ -90,22 +90,22 @@ namespace fst {
90 at the beginning of the utterance (before we start outputting symbols). 90 at the beginning of the utterance (before we start outputting symbols).
91 This is necessary to fix a rather subtle problem that happens when we have 91 This is necessary to fix a rather subtle problem that happens when we have
92 words with an empty phonetic representation (e.g. the beginning and 92 words with an empty phonetic representation (e.g. the beginning and
93 end of sentence symbols \<s\> and \</s\>). 93 end of sentence symbols \<s\> and \</s\>).
94 94
95 We give the outline of how we would formally prove that the 95 We give the outline of how we would formally prove that the
96 intermediate stages of graph compilation (e.g. LG, CLG, HCLG) are determinizable; 96 intermediate stages of graph compilation (e.g. LG, CLG, HCLG) are determinizable;
97 this is important in ensuring that our recipe never fails. By determinizable, 97 this is important in ensuring that our recipe never fails. By determinizable,
98 we mean determinizable after epsilon removal. The general setup is: 98 we mean determinizable after epsilon removal. The general setup is:
99 first, we stipulate that G must be determinizable. This is why we need the \#0 99 first, we stipulate that G must be determinizable. This is why we need the \#0
100 symbols (G is actually deterministic, hence determinizable). Then we want L to 100 symbols (G is actually deterministic, hence determinizable). Then we want L to
101 be such that for any determinizable G, L o G is determinizable. [The same 101 be such that for any determinizable G, L o G is determinizable. [The same
102 goes for C, with L o G on the right instead of G]. There are a lot of details 102 goes for C, with L o G on the right instead of G]. There are a lot of details
103 of the theory still to be fleshed out, but I believe it is sufficient for L to have two 103 of the theory still to be fleshed out, but I believe it is sufficient for L to have two
104 properties: 104 properties:
105 - \f$ L^{-1} \f$ must be functional 105 - \f$ L^{-1} \f$ must be functional
106 - equivalently: any input-sequence on L must induce a unique output-sequence 106 - equivalently: any input-sequence on L must induce a unique output-sequence
107 - equivalently: for any linear acceptor A, A o L is a linear transducer or empty. 107 - equivalently: for any linear acceptor A, A o L is a linear transducer or empty.
108 - L has the twins property, i.e. there are no two states reachable 108 - L has the twins property, i.e. there are no two states reachable
109 with the same input-symbol sequence, that each have a cycle with the 109 with the same input-symbol sequence, that each have a cycle with the
110 same input sequence but different weight or output sequence. 110 same input sequence but different weight or output sequence.
111 111
@@ -117,7 +117,7 @@ namespace fst {
117 The \ref ContextFst object (C) is a dynamically created 117 The \ref ContextFst object (C) is a dynamically created
118 FST object that represents a transducer from context-dependent phones to 118 FST object that represents a transducer from context-dependent phones to
119 context-independent phones. The purpose of this object is to avoid us having 119 context-independent phones. The purpose of this object is to avoid us having
120 to enumerate all possible phones in context, which could be difficult when 120 to enumerate all possible phones in context, which could be difficult when
121 the context width (N) or the number of phones is larger than normal. The 121 the context width (N) or the number of phones is larger than normal. The
122 constructor ContextFst::ContextFst requires the context-width (N) and 122 constructor ContextFst::ContextFst requires the context-width (N) and
123 central-position (P) which are further explained in \ref tree_window; their 123 central-position (P) which are further explained in \ref tree_window; their
@@ -125,7 +125,7 @@ namespace fst {
125 It also requires the integer id of a special symbol, the "subsequential 125 It also requires the integer id of a special symbol, the "subsequential
126 symbol" (referred to as '$' in the reference given above), which the FST outputs 126 symbol" (referred to as '$' in the reference given above), which the FST outputs
127 N-P-1 times after all phones are seen (this ensures that the context FST 127 N-P-1 times after all phones are seen (this ensures that the context FST
128 is output-deterministic). In addition to this it requires lists of the 128 is output-deterministic). In addition to this it requires lists of the
129 integer id's of the phones and the disambiguation symbols. The vocabulary 129 integer id's of the phones and the disambiguation symbols. The vocabulary
130 on the output side of the ContextFst consists of the set of phones and 130 on the output side of the ContextFst consists of the set of phones and
131 disambiguation symbols plus the subsequential symbol. The vocabulary on the 131 disambiguation symbols plus the subsequential symbol. The vocabulary on the
@@ -135,14 +135,14 @@ namespace fst {
135 place of epsilon in the "traditional recipe", but which we treat as any other 135 place of epsilon in the "traditional recipe", but which we treat as any other
136 disambiguation symbol (it is necessary to ensure determinizability in the sense 136 disambiguation symbol (it is necessary to ensure determinizability in the sense
137 we prefer, i.e. with epsilon removal). The subsequential symbol '$' has nothing corresponding 137 we prefer, i.e. with epsilon removal). The subsequential symbol '$' has nothing corresponding
138 to it on the input side (this is also the case in the traditional recipe). 138 to it on the input side (this is also the case in the traditional recipe).
139 The symbol id's corresponding to the disambiguation 139 The symbol id's corresponding to the disambiguation
140 symbols on the input side are not necessarily the same integer identifiers as used on the output 140 symbols on the input side are not necessarily the same integer identifiers as used on the output
141 side for the corresponding symbols. 141 side for the corresponding symbols.
142 The ContextFst object has a function ILabelInfo() 142 The ContextFst object has a function ILabelInfo()
143 which returns a reference to an object of type std::vector<std::vector<int32> > which 143 which returns a reference to an object of type std::vector<std::vector<int32> > which
144 enables the user to work out the "meaning" of each symbol on the input side. The 144 enables the user to work out the "meaning" of each symbol on the input side. The
145 correct interpretation of this object is described further in \ref tree_ilabel. 145 correct interpretation of this object is described further in \ref tree_ilabel.
146 146
147 There is a special "Matcher" object called ContextMatcher which is intended to be 147 There is a special "Matcher" object called ContextMatcher which is intended to be
148 used in composition algorithms involving the ContextFst (a Matcher is something that OpenFst's 148 used in composition algorithms involving the ContextFst (a Matcher is something that OpenFst's
@@ -162,14 +162,14 @@ namespace fst {
162 We deal with the whole question of weight pushing in a slightly different way from 162 We deal with the whole question of weight pushing in a slightly different way from
163 the traditional recipe. Weight pushing in the log semiring can be helpful in speeding 163 the traditional recipe. Weight pushing in the log semiring can be helpful in speeding
164 up search (weight pushing is an FST operation that ensures that the arc probabilities 164 up search (weight pushing is an FST operation that ensures that the arc probabilities
165 out of each state "sum to one" in the appropriate sense). 165 out of each state "sum to one" in the appropriate sense).
166 However, in some cases weight pushing can have bad effects. The issue is that 166 However, in some cases weight pushing can have bad effects. The issue is that
167 statistical language models, when represented as FSTs, generally "add up to more than one" 167 statistical language models, when represented as FSTs, generally "add up to more than one"
168 because some words are counted twice (directly, and via backoff arcs). 168 because some words are counted twice (directly, and via backoff arcs).
169 169
170 We decided to avoid ever pushing weights, so instead we handle the whole issue in a different 170 We decided to avoid ever pushing weights, so instead we handle the whole issue in a different
171 way. Firstly, a definition: we call a "stochastic" FST one whose weights sum to one, and 171 way. Firstly, a definition: we call a "stochastic" FST one whose weights sum to one, and
172 the reader can assume that we are talking about the log semiring here, not the tropical one, 172 the reader can assume that we are talking about the log semiring here, not the tropical one,
173 so that "sum to one" really means sum, and not "take the max". 173 so that "sum to one" really means sum, and not "take the max".
174 We ensure that each stage of graph creation has the property that if the previous stage 174 We ensure that each stage of graph creation has the property that if the previous stage
175 was stochastic, then the next stage will be stochastic. That is: if G is stochastic, 175 was stochastic, then the next stage will be stochastic. That is: if G is stochastic,
@@ -179,12 +179,12 @@ namespace fst {
179 stochasticity". Now, it would be quite possible to do this in a very trivial but not-very-useful 179 stochasticity". Now, it would be quite possible to do this in a very trivial but not-very-useful
180 way: for instance, we could just try the push-weights algorithm and if it seems to be failing because, 180 way: for instance, we could just try the push-weights algorithm and if it seems to be failing because,
181 say, the original G fst summed up to more than one, then we throw up our hands in horror and 181 say, the original G fst summed up to more than one, then we throw up our hands in horror and
182 announce failure. This would not be very helpful. 182 announce failure. This would not be very helpful.
183 183
184 We want to preserve stochasticity in 184 We want to preserve stochasticity in
185 a stronger sense, which is to say: first measure, for G, the min and max over all its states, 185 a stronger sense, which is to say: first measure, for G, the min and max over all its states,
186 of the sum of the (arc probabilities plus final-prob) out of those states. This is what our 186 of the sum of the (arc probabilities plus final-prob) out of those states. This is what our
187 program "fstisstochastic" does. If G is stochastic, both of these numbers would be one 187 program "fstisstochastic" does. If G is stochastic, both of these numbers would be one
188 (you would actually see zeros from the program, because actually we operate in log space; this is 188 (you would actually see zeros from the program, because actually we operate in log space; this is
189 what "log semiring" means). We want to preserve stochasticity in the following sense: that this 189 what "log semiring" means). We want to preserve stochasticity in the following sense: that this
190 min and max never "get worse"; that is, they never get farther away from one. In fact, this 190 min and max never "get worse"; that is, they never get farther away from one. In fact, this
@@ -194,28 +194,28 @@ namespace fst {
194 - Determinization 194 - Determinization
195 - Epsilon removal 195 - Epsilon removal
196 - Composition (with particular FSTs on the left) 196 - Composition (with particular FSTs on the left)
197 There are also one or two minor algorithms that need to preserve stochasticity, 197 There are also one or two minor algorithms that need to preserve stochasticity,
198 like adding a subsequential-symbol loop. 198 like adding a subsequential-symbol loop.
199 Minimization naturally preserves stochasticity, as long as we don't do any weight pushing 199 Minimization naturally preserves stochasticity, as long as we don't do any weight pushing
200 as part of it (we use our program "fstminimizeencoded" which does minimization without 200 as part of it (we use our program "fstminimizeencoded" which does minimization without
201 weight pushing). Determinization preserves stochasticity 201 weight pushing). Determinization preserves stochasticity
202 as long as we do it in the same semiring that we want to preserve stochasticity in (this means 202 as long as we do it in the same semiring that we want to preserve stochasticity in (this means
203 the log semiring; this is why we use our program fstdeterminizestar with the option 203 the log semiring; this is why we use our program fstdeterminizestar with the option
204 --determinize-in-log=true). Regarding epsilon removal: firstly, we have 204 --determinize-in-log=true). Regarding epsilon removal: firstly, we have
205 our own version of epsilon removal "RemoveEpsLocal()" (fstrmepslocal), which doesn't guarantee 205 our own version of epsilon removal "RemoveEpsLocal()" (fstrmepslocal), which doesn't guarantee
206 to remove all epsilons but does guarantee to never "blow up". This algorithm is unusual among 206 to remove all epsilons but does guarantee to never "blow up". This algorithm is unusual among
207 FST algorithms in that, to to what we need it to do and preserve stochasticity, it needs to 207 FST algorithms in that, to to what we need it to do and preserve stochasticity, it needs to
208 "keep track of" two semirings 208 "keep track of" two semirings
209 at the same time. That is, if it is to preserve equivalence in the tropical semiring and 209 at the same time. That is, if it is to preserve equivalence in the tropical semiring and
210 stochasticity in the log semiring, which is what we need in practice, 210 stochasticity in the log semiring, which is what we need in practice,
211 it actually has to "know about" both semirings simultaneously. This seems to be an edge case 211 it actually has to "know about" both semirings simultaneously. This seems to be an edge case
212 where the "semiring" concept somewhat breaks down. 212 where the "semiring" concept somewhat breaks down.
213 Composition on the left with the lexicon L, the context FST C and the H tranducer (which encodes 213 Composition on the left with the lexicon L, the context FST C and the H tranducer (which encodes
214 the HMM structure) all have to preserve stochasticity. Let's discuss this the abstract: 214 the HMM structure) all have to preserve stochasticity. Let's discuss this the abstract:
215 we ask, when composing A o B, what are sufficient properties that A 215 we ask, when composing A o B, what are sufficient properties that A
216 must have so that A o B will be stochastic whenever B is stochastic? We believe these properties are: 216 must have so that A o B will be stochastic whenever B is stochastic? We believe these properties are:
217 217
218 - For any symbol sequence that can appear on the input of B, the inverse of A must accept 218 - For any symbol sequence that can appear on the input of B, the inverse of A must accept
219 that sequence (i.e. it must be possible for A to output that sequence), and: 219 that sequence (i.e. it must be possible for A to output that sequence), and:
220 - For any such symbol sequence (say, S), if we compose A with an unweighted linear FST with 220 - For any such symbol sequence (say, S), if we compose A with an unweighted linear FST with
221 S on its input, the result will be stochastic. 221 S on its input, the result will be stochastic.
@@ -223,14 +223,14 @@ namespace fst {
223 These properties are true of C, L and H, at least if everything is properly normalized 223 These properties are true of C, L and H, at least if everything is properly normalized
224 (i.e. if the lexicon weights sum to one for any given word, and if the HMMs in H are properly 224 (i.e. if the lexicon weights sum to one for any given word, and if the HMMs in H are properly
225 normalized and we don't use a probability scale). However, in practice in our graph creation 225 normalized and we don't use a probability scale). However, in practice in our graph creation
226 recipes we use a probability scale on the transition probabilities in the HMMs (similar to the 226 recipes we use a probability scale on the transition probabilities in the HMMs (similar to the
227 acoustic scale). This means that the very last stages of graph creation typically don't 227 acoustic scale). This means that the very last stages of graph creation typically don't
228 preserve stochasticity. Also, if we are using a statistical language model, G will typically 228 preserve stochasticity. Also, if we are using a statistical language model, G will typically
229 not be stochastic in the first place. What we do in this case is we measure at the start 229 not be stochastic in the first place. What we do in this case is we measure at the start
230 how much it "deviates from stochasticity" (using the program fstisstochastic), and during 230 how much it "deviates from stochasticity" (using the program fstisstochastic), and during
231 subsequent graph creation stages (except for the very last one) we verify that the 231 subsequent graph creation stages (except for the very last one) we verify that the
232 non-stochasticity does not "get worse" than it was at the beginning. 232 non-stochasticity does not "get worse" than it was at the beginning.
233 233
234 234
235*/ 235*/
236 236