[doc] Add some terms to glossary

author: Daniel Povey 2017-08-21 18:06:11 -0500
committer: Daniel Povey 2017-08-21 18:06:11 -0500
commit: 92dad5b6542bdbe0ef876b73cebbd16c391c56bf (patch)
tree: a5bd34103a7a96e6ad810c1491af723b63febd80
parent: 91d47c1dc4cee5b28a2c13b1c63bbce8ecc8a54d (diff)
download: kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.tar.gz
kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.tar.xz
kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.zip
2 files changed, 72 insertions, 42 deletions
diff --git a/src/doc/glossary.dox b/src/doc/glossary.dox
index 8bb8283e5..e1b3c01f7 100644
--- a/src/doc/glossary.dox
+++ b/src/doc/glossary.dox
@@ -50,11 +50,41 @@ contain alignment information as sequences of transition-ids for each word
 sequence in the lattice.  The program \ref bin/show-alignments.cc "show-alignments" shows
 alignments in a human-readable format.
+<b>:cost:</b>  Any quantity which is used as a 'cost' in a weighted FST
+   algorithm (e.g. acoustic cost, language model cost; see \ref lattices
+   for more details).  Costs are, generally speaking, interpretable
+   as a negative log of a likelihood or probability, but there may
+   be scaling factors involved.
 <b>:forced alignment:</b> see <b>alignment</b>.
 <b>:lattice:</b>  A representation of alternative likely transcriptions of an utterance, together
 with associated alignment and cost information.  See \ref lattices.
+<b>:likelihood:</b>  A mathematical concept meaning the value of a function representing
+  the distribution of a continuous value.  These can be more than one.  Often represented
+  in log space (as log-likelihood) because likelihood values of multi-dimensional
+  features can often be too small or large to fit in standard floating-point precision.
+  With standard cross-entropy trained neural net systems we obtain "pseudo-likelihoods"
+  by dividing log-probabilities by the priors of context-dependent states.
+<b>:posterior:</b>  "Posterior" is shorthand for "posterior probability" which is a very
+  general mathematical concept, generally meaning "the probability of some random
+  variable after seeing the relevant data".  In general, posteriors will sum to one.
+  In Kaldi terminology, if you encounter the term "posterior", abbreviated to "post",
+  without further expanation it generally means the per-frame posterior probability of
+  transition-ids.  However these posteriors may be very peaky (i.e. mostly ones and zeros)
+  depending how you obtained them, e.g. from a lattice or from an alignment.
+  Alignments and lattices can be converted to posteriors over transition-ids (see \ref lattice-to-post.cc),
+  or over lattice arcs (see \ref ali-to-post.cc and \ref lattice-arc-post.cc).
+  Posteriors over transition-ids can be converted to posteriors over pdf-ids or over phones;
+  see the tools \ref ali-to-post.cc, \ref post-to-pdf-post.cc and \ref post-to-phone-post.cc.
+<b>:pdf-id:</b>  The zero-based integer index of a clustered context-dependent HMM state; see
+  \ref transition_model_identifiers for more information.
 <b>:transition-id:</b> a one-based index that encodes the pdf-id (i.e. the clustered context-dependent HMM state),
 the phone identity, and information about whether we took the self-loop or forward transition in the HMM.
 Appears in lattices, decoding graphs and alignments.  See \ref transition_model.
diff --git a/src/doc/graph.dox b/src/doc/graph.dox
index 849abd3e7..cf0f99acb 100644
--- a/src/doc/graph.dox
+++ b/src/doc/graph.dox
@@ -24,35 +24,35 @@ namespace fst {
   \page graph Decoding graph construction in Kaldi
  Firstly, we cannot hope to introduce finite state transducers and how they
-  are used in speech recognition.  For that, see 
+  are used in speech recognition.  For that, see
-  <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition 
+  <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition
-  with Weighted Finite-State Transducers" </a> by Mohri, Pereira and 
+  with Weighted Finite-State Transducers" </a> by Mohri, Pereira and
-  Riley (in Springer Handbook on SpeechProcessing and Speech Communication, 2008). 
+  Riley (in Springer Handbook on SpeechProcessing and Speech Communication, 2008).
-  The general approach we use is as described there, but some of the details, 
+  The general approach we use is as described there, but some of the details,
  particularly with respect to how
  we handle disambiguation symbols and how we deal with weight-pushing, differ.
  \section graph_overview Overview of graph creation
-  The overall picture for decoding-graph creation is that we 
+  The overall picture for decoding-graph creation is that we
  are constructing the graph  HCLG = H o C o L o G.  Here
-   - G is an acceptor (i.e. its input and output symbols are the same) 
+   - G is an acceptor (i.e. its input and output symbols are the same)
      that encodes the grammar or language model.
   - L is the lexicon; its output symbols are words and its input
     symbols are phones.
   - C represents the context-dependency: its output symbols are phones
-     and its input symbols represent context-dependent phones, i.e. windows 
+     and its input symbols represent context-dependent phones, i.e. windows
     of N phones; see \ref tree_window.
   - H contains the HMM definitions; its output symbols represent context-dependent
-     phones and its input symbols are transitions-ids, which encode the pdf-id and
+     phones and its input symbols are transition-ids, which encode the pdf-id and
     other information (see \ref transition_model_identifiers)
  This is the standard recipe.  However, there are a lot of details to be
  filled in.  We want to ensure that the output is determinized and minimized,
-  and in order for HCLG to be determinizable we have to insert disambiguation 
+  and in order for HCLG to be determinizable we have to insert disambiguation
-  symbols.  For details on the disambiguation symbols, see below 
+  symbols.  For details on the disambiguation symbols, see below
-  \ref graph_disambig.  
+  \ref graph_disambig.
  We also want to ensure the HCLG is stochastic, as far as possible; in the
 conventional recipes, this is done (if at all) with the "push-weights"
@@ -72,7 +72,7 @@ namespace fst {
  as long as G was stochastic.  Of course, G will typically not be quite
  stochastic, because of the way Arpa language models with backoff are represented in FSTs,
  but at least our approach ensures that the non-stochasticity "stays
-  put" and does not get worse than it was at the start; this approach avoids 
+  put" and does not get worse than it was at the start; this approach avoids
  the danger of the "push-weights" operation failing or making things worse.
  \section graph_disambig Disambiguation symbols
@@ -80,7 +80,7 @@ namespace fst {
 Disambiguation symbols are the symbols \#1, \#2, \#3  and so on that
 are inserted at the end of phonemene sequences in the lexicon.  When
 a phoneme sequence is a prefix of another phoneme sequence in the lexicon,
- or appears in more than one word, it needs to have one of 
+ or appears in more than one word, it needs to have one of
 these symbols added after it.  These symbols are needed to ensure that
 the product L o G is determinizable.  We also insert disambiguation symbols
 in two more places.  We have a symbol \#0 on the backoff arcs in the
@@ -90,22 +90,22 @@ namespace fst {
 at the beginning of the utterance (before we start outputting symbols).
 This is necessary to fix a rather subtle problem that happens when we have
 words with an empty phonetic representation (e.g. the beginning and
- end of sentence symbols \<s\> and \</s\>).  
+ end of sentence symbols \<s\> and \</s\>).
 We give the outline of how we would formally prove that the
 intermediate stages of graph compilation (e.g. LG, CLG, HCLG) are determinizable;
 this is important in ensuring that our recipe never fails.  By determinizable,
 we mean determinizable after epsilon removal.  The general setup is:
 first, we stipulate that G must be determinizable.  This is why we need the \#0
- symbols (G is actually deterministic, hence determinizable).  Then we want L to 
+ symbols (G is actually deterministic, hence determinizable).  Then we want L to
 be such that for any determinizable G, L o G is determinizable.  [The same
 goes for C, with L o G on the right instead of G].  There are a lot of details
- of the theory still to be fleshed out, but I believe it is sufficient for L to have two 
+ of the theory still to be fleshed out, but I believe it is sufficient for L to have two
 properties:
    - \f$ L^{-1} \f$ must be functional
-      - equivalently: any input-sequence on L must induce a unique output-sequence 
+      - equivalently: any input-sequence on L must induce a unique output-sequence
      - equivalently: for any linear acceptor A, A o L is a linear transducer or empty.
-    - L has the twins property, i.e. there are no two states reachable 
+    - L has the twins property, i.e. there are no two states reachable
      with the same input-symbol sequence, that each have a cycle with the
      same input sequence but different weight or output sequence.
@@ -117,7 +117,7 @@ namespace fst {
 The \ref ContextFst object (C) is a dynamically created
 FST object that represents a transducer from context-dependent phones to
 context-independent phones.  The purpose of this object is to avoid us having
- to enumerate all possible phones in context, which could be difficult when 
+ to enumerate all possible phones in context, which could be difficult when
 the context width (N) or the number of phones is larger than normal.  The
 constructor ContextFst::ContextFst requires the context-width (N) and
 central-position (P) which are further explained in \ref tree_window; their
@@ -125,7 +125,7 @@ namespace fst {
 It also requires the integer id of a special symbol, the "subsequential
 symbol" (referred to as '$' in the reference given above), which the FST outputs
 N-P-1 times after all phones are seen (this ensures that the context FST
- is output-deterministic).  In addition to this it requires lists of the 
+ is output-deterministic).  In addition to this it requires lists of the
 integer id's of the phones and the disambiguation symbols.  The vocabulary
 on the output side of the ContextFst consists of the set of phones and
 disambiguation symbols plus the subsequential symbol.  The vocabulary on the
@@ -135,14 +135,14 @@ namespace fst {
 place of epsilon in the "traditional recipe", but which we treat as any other
 disambiguation symbol (it is necessary to ensure determinizability in the sense
 we prefer, i.e. with epsilon removal).  The subsequential symbol '$' has nothing corresponding
- to it on the input side (this is also the case in the traditional recipe).  
+ to it on the input side (this is also the case in the traditional recipe).
 The symbol id's corresponding to the disambiguation
- symbols on the input side are not necessarily the same integer identifiers as used on the output 
+ symbols on the input side are not necessarily the same integer identifiers as used on the output
- side for the corresponding symbols.  
+ side for the corresponding symbols.
 The ContextFst object has a function ILabelInfo()
 which returns a reference to an object of type std::vector<std::vector<int32> > which
 enables the user to work out the "meaning" of each symbol on the input side.  The
- correct interpretation of this object is described further in \ref tree_ilabel. 
+ correct interpretation of this object is described further in \ref tree_ilabel.
 There is a special "Matcher" object called ContextMatcher which is intended to be
 used in composition algorithms involving the ContextFst (a Matcher is something that OpenFst's
@@ -162,14 +162,14 @@ namespace fst {
 We deal with the whole question of weight pushing in a slightly different way from
 the traditional recipe.  Weight pushing in the log semiring can be helpful in speeding
 up search (weight pushing is an FST operation that ensures that the arc probabilities
- out of each state "sum to one" in the appropriate sense). 
+ out of each state "sum to one" in the appropriate sense).
 However, in some cases weight pushing can have bad effects.  The issue is that
 statistical language models, when represented as FSTs, generally "add up to more than one"
 because some words are counted twice (directly, and via backoff arcs).
 We decided to avoid ever pushing weights, so instead we handle the whole issue in a different
 way.  Firstly, a definition: we call a "stochastic" FST one whose weights sum to one, and
- the reader can assume that we are talking about the log semiring here, not the tropical one, 
+ the reader can assume that we are talking about the log semiring here, not the tropical one,
 so that "sum to one" really means sum, and not "take the max".
 We ensure that each stage of graph creation has the property that if the previous stage
 was stochastic, then the next stage will be stochastic.  That is: if G is stochastic,
@@ -179,12 +179,12 @@ namespace fst {
 stochasticity".  Now, it would be quite possible to do this in a very trivial but not-very-useful
 way: for instance, we could just try the push-weights algorithm and if it seems to be failing because,
 say, the original G fst summed up to more than one, then we throw up our hands in horror and
- announce failure.  This would not be very helpful. 
+ announce failure.  This would not be very helpful.
 We want to preserve stochasticity in
 a stronger sense, which is to say: first measure, for G, the min and max over all its states,
 of the sum of the (arc probabilities plus final-prob) out of those states.   This is what our
- program "fstisstochastic" does.  If G is stochastic, both of these numbers would be one 
+ program "fstisstochastic" does.  If G is stochastic, both of these numbers would be one
 (you would actually see zeros from the program, because actually we operate in log space; this is
 what "log semiring" means).  We want to preserve stochasticity in the following sense: that this
 min and max never "get worse"; that is, they never get farther away from one.  In fact, this
@@ -194,28 +194,28 @@ namespace fst {
   - Determinization
   - Epsilon removal
   - Composition (with particular FSTs on the left)
- There are also one or two minor algorithms that need to preserve stochasticity, 
+ There are also one or two minor algorithms that need to preserve stochasticity,
 like adding a subsequential-symbol loop.
 Minimization naturally preserves stochasticity, as long as we don't do any weight pushing
 as part of it (we use our program "fstminimizeencoded" which does minimization without
 weight pushing).  Determinization preserves stochasticity
 as long as we do it in the same semiring that we want to preserve stochasticity in (this means
- the log semiring; this is why we use our program fstdeterminizestar with the option 
+ the log semiring; this is why we use our program fstdeterminizestar with the option
 --determinize-in-log=true).  Regarding epsilon removal: firstly, we have
 our own version of epsilon removal "RemoveEpsLocal()" (fstrmepslocal), which doesn't guarantee
 to remove all epsilons but does guarantee to never "blow up".  This algorithm is unusual among
- FST algorithms in that, to to what we need it to do and preserve stochasticity, it needs to 
+ FST algorithms in that, to to what we need it to do and preserve stochasticity, it needs to
- "keep track of" two semirings 
+ "keep track of" two semirings
- at the same time.  That is, if it is to preserve equivalence in the tropical semiring and 
+ at the same time.  That is, if it is to preserve equivalence in the tropical semiring and
 stochasticity in the log semiring, which is what we need in practice,
 it actually has to "know about" both semirings simultaneously.  This seems to be an edge case
 where the "semiring" concept somewhat breaks down.
 Composition on the left with the lexicon L, the context FST C and the H tranducer (which encodes
 the HMM structure) all have to preserve stochasticity.  Let's discuss this the abstract:
 we ask, when composing A o B, what are sufficient properties that A
- must have so that A o B will be stochastic whenever B is stochastic?  We believe these properties are: 
+ must have so that A o B will be stochastic whenever B is stochastic?  We believe these properties are:
-    - For any symbol sequence that can appear on the input of B, the inverse of A must accept 
+    - For any symbol sequence that can appear on the input of B, the inverse of A must accept
      that sequence (i.e. it must be possible for A to output that sequence), and:
    - For any such symbol sequence (say, S), if we compose A with an unweighted linear FST with
      S on its input, the result will be stochastic.
@@ -223,14 +223,14 @@ namespace fst {
 These properties are true of C, L and H, at least if everything is properly normalized
 (i.e. if the lexicon weights sum to one for any given word, and if the HMMs in H are properly
 normalized and we don't use a probability scale).  However, in practice in our graph creation
- recipes we use a probability scale on the transition probabilities in the HMMs (similar to the 
+ recipes we use a probability scale on the transition probabilities in the HMMs (similar to the
- acoustic scale).  This means that the very last stages of graph creation typically don't 
+ acoustic scale).  This means that the very last stages of graph creation typically don't
- preserve stochasticity.  Also, if we are using a statistical language model, G will typically 
+ preserve stochasticity.  Also, if we are using a statistical language model, G will typically
 not be stochastic in the first place.  What we do in this case is we measure at the start
 how much it "deviates from stochasticity" (using the program fstisstochastic), and during
- subsequent graph creation stages (except for the very last one) we verify that the 
+ subsequent graph creation stages (except for the very last one) we verify that the
- non-stochasticity does not "get worse" than it was at the beginning.  
+ non-stochasticity does not "get worse" than it was at the beginning.
- 
 */
author	Daniel Povey	2017-08-21 18:06:11 -0500
committer	Daniel Povey	2017-08-21 18:06:11 -0500
commit	92dad5b6542bdbe0ef876b73cebbd16c391c56bf (patch)
tree	a5bd34103a7a96e6ad810c1491af723b63febd80
parent	91d47c1dc4cee5b28a2c13b1c63bbce8ecc8a54d (diff)
download	kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.tar.gz kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.tar.xz kaldi-92dad5b6542bdbe0ef876b73cebbd16c391c56bf.zip