nips nips2004 nips2004-107 knowledge-graph by maker-knowledge-mining

107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Source: pdf

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

Abstract: We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. A generalized HMM is ﬁtted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. We show results for unigram, bigram and trigram models. Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. Furthermore, our model produces fair searches on a document from which we obtained no training data. 1. Intoduction There are many large corpora of handwritten scanned documents, and their number is growing rapidly. Collections range from the complete works of Mark Twain to thousands of pages of zoological notes spanning two centuries. Large scale analyses of such corpora is currently very difﬁcult, because handwriting recognition works poorly. Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism [7]. The result can make scanned handwritten documents searchable. Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. This approach is particularly unattractive for an inﬂected language, because individual words can take so many forms that one is unlikely to see all in the training set. Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. Mediaeval Latin manuscripts are a natural ﬁrst corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). Desirable features for a system are: First, that it use little or no aligned training data (an ideal, which we believe may be attainable, is an unsupervised learning system). Second, that one can search the document for an arbitrary string (rather than, say, only complete words that appear in the training data). This would allow a user to determine whether a document contains curious or distinctive spellings, for example (ﬁgure 7). We show that, using a statistical model based on a generalized HMM, we can search a medieval manuscript with considerable accuracy, using only one instance each of each letter in the manuscript to train the method (22 instances in total; Latin has no j, k, w, or z). Furthermore, our method allows fairly accurate transcription of the manuscript. We train our system on 22 glyphs taken from a a 12th century latin manuscript of Terence’s Comedies (obtained from a repository of over 80 scanned medieval works maintained by Oxford University [1]). We evaluate searches using a considerable portion of this manuscript aligned by hand; we then show that fair search results are available on a different manuscript (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]) without change of letter templates. 1.1. Previous Work Handwriting recognition is a traditional problem, too well studied to review in detail here (see [6]). Typically, online handwriting recognition (where strokes can be recorded) works better than ofﬂine handwriting recognition. Handwritten digits can now be recognized with high accuracy [2, 5]. Handwritten amounts can be read with fair accuracy, which is signiﬁcantly improved if one segments the amount into digits at the same time as one recognizes it [4, 5]. Recently several authors have proposed new techniques for search and translation in this unrestricted setting. Manmatha et al [7] introduce the technique of “word spotting,” which segments text into word images, rectiﬁes the word images, and then uses an aligned training set to learn correspondences between rectiﬁed word images and strings. The method is not suitable for a heavily inﬂected language, because words take so many forms. In an inﬂected language, the natural unit to match to is a subset of a word, rather than a whole word, implying that one should segment the text into blocks — which may be smaller than words — while recognizing. Vinciarelli et al [8] introduce a method for line by line recognition based around an HMM and quite similar to techniques used in the speech recognition community. Their method uses a window that slides along the text to obtain features; this has the difﬁculty that the same window is in some places too small (and so uninformative) and in others too big (and so spans more than one letter, and is confusing). Their method requires a substantial body of aligned training data, which makes it impractical for our applications. Close in spirit to our work is the approach to machine translation of Koehn and Knight [3]. They demonstrate that the statistics of unaligned corpora may provide as powerful constraints for training models as aligned bitexts. 2. The Model Our models for both search and transcription are based on the generalized HMM and differ only in their choice of transition model. In an HMM, each hidden node ct emits a single evidence node xt . In a generalized HMM, we allow each ct to emit a series of x’s whose length is itself a random variable. In our model, the hidden nodes correspond to letters and each xt is a single column of pixels. Allowing letters to emit sets of columns lets us accomodate letter templates of variable width. In particular, this means that we can unify segmenting ink into letters and recognizing blocks of ink; ﬁgure 3 shows an example of how useful this is. 2.1. Generating a line of text Our hidden state consists of a character label c, width w and vertical position y. The statespace of c contains the characters ‘a’-‘z’, a space ‘ ’, and a special end state Ω. Let T c be the template associated with character c, Tch , Tcw be respectively the height and width of that template, and m be the height of the image. Figure 1: Left, a full page of our manuscript, a 12’th century manuscript of Terence’s Comedies obtained from [1]. Top right, a set of lines from a page from that document and bottom right, some words in higher resolution. Note: (a) the richness of page layout; (b) the clear spacing of the lines; (c) the relatively regular handwriting. Figure 2: Left, the 22 instances, one per letter, used to train our emission model. These templates are extracted by hand from the Terence document. Right, the ﬁve image channels for a single letter. Beginning at image column 1 (and assuming a dummy space before the ﬁrst character), • • • • choose character c ∼ p(c|c−1...−n ) (an n-gram letter model) choose length w ∼ Uniform(Tcw − k, Tcw + k) (for some small k) choose vertical position y ∼ Uniform(1, m − Tch ) z,y and Tch now deﬁne a bounding box b of pixels. Let i and j be indexed from the top left of that bounding box. – draw pixel (i, j) ∼ N (Tcij , σcij ) for each pixel in b – draw all pixels above and below b from background gaussian N (µ0 , σ0 ) (See 2.2 for greater detail on pixel emission model) • move to column w + 1 and repeat until we enter the end state Ω. Inference on a gHMM is a relatively straighforward business of dynamic programming. We have used unigram, bigram and trigram models, with each transition model ﬁtted using an electronic version of Caesar’s Gallic Wars, obtained from http://www.thelatinlibrary.com. We do not believe that the choice of author should signiﬁcantly affect the ﬁtted transition model — which is at the level of characters — but have not experimented with this point. The important matter is the emission model. 2.2. The Emission Model Our emission model is as follows: Given the character c and width w, we generate a template of the required length. Each pixel in this template becomes the mean of a gaussian which generates the corresponding pixel in the image. This template has a separate mean image for each pixel channel. The channels are assumed independent given the means. We train the model by cutting out by hand a single instance of each letter from our corpus (ﬁgure 2). This forms the central portion of the template. Pixels above and below this Model Perfect transcription unigram bigram trigram matching chars 21019 14603 15572 15788 substitutions 0 5487 4597 4410 insertions 0 534 541 507 deletions 0 773 718 695 Table 1: Edit distance between our transcribed Terence and the editor’s version. Note the trigram model produces signiﬁcantly fewer letter errors than the unigram model, but that the error rate is still a substantial 25%. central box are generated from a single gaussian used to model background pixels (basically white pixels). We add a third variable yt to our hidden state indicating the vertical position of the central box. However, since we are uninterested in actually recovering this variable, during inference we sum it out of the model. The width of a character is constrained to be close to the width (tw ) of our hand cut example by setting p(w|c) = 0 for w < tw − k and w > tw + k. Here k is a small, user deﬁned integer. Within this range, p(w|c) is distributed uniformly, larger templates are created by appending pixels from the background model to the template and smaller ones by simply removing the right k-most columns of the hand cut example. For features, we generate ﬁve image representations, shown in ﬁgure 2. The ﬁrst is a grayscale version of the original color image. The second and third are generated by convolving the grayscale image with a vertical derivative of gaussian ﬁlter, separating the positive and negative components of this response, and smoothing each of these gradient images separately. The fourth and ﬁfth are generated similarly but with a horizontal derivative of gaussian ﬁlter. We have experimented with different weightings of these 5 channels. In practice we use the gray scale channel and the horizontal gradient channels. We emphasize the horizontal pieces since these seem the more discriminative. 2.3. Transcription For transcription, we model letters as coming from an n-gram language model, with no dependencies between words. Thus, the probability of a letter depends on the k letters before it, where k = n unless this would cross a word boundary in which case the history terminates at this boundary. We chose not to model word to word transition probabilities since, unlike in English, word order in Latin is highly arbitrary. This transition model is ﬁt from a corpus of ascii encoded latin. We have experimented with unigram (i.e. uniform transition probabilities), bigram and trigram letter models. We can perform transcription by ﬁtting the maximum likelihood path through any given line. Some results of this technique are shown in ﬁgure 3. 2.4. Search For search, we rank lines by the probability that they contain our search word. We set up a ﬁnite state machine like that in ﬁgure 4. In this ﬁgure, ‘bg’ represents our background model for that portion of the line not generated by our search word. We can use any of the n-gram letter models described for transcription as the transition model for ‘bg’. The probability that the line contains the search word is the probability that this FSM takes path 1. We use this FSM as the transition model for our gHMM, and output the posterior probability of the two arrows leading into the end state. 1 and 2 are user deﬁned weights, but in practice the algorithm does not appear to be particular sensitive to the choice of these parameters. The results presented here use the unigram model. Editorial translation Orator ad vos venio ornatu prologi: unigram b u rt o r a d u o s u em o o r n a t u p r o l o g r b u rt o r a d v o s v em o o r u a t u p r o l o g r fo r a t o r a d v o s v en i o o r n a t u p r o l o g i bigram trigram Figure 3: We transcribe the text by ﬁnding the maximum likelihood path through the gHMM. The top line shows the standard version of the line (obtained by consensus among editors who have consulted various manuscripts; we obtained this information in electronic form from http://www.thelatinlibrary.com). Below, we show the line as segmented and transcribed by unigram, bigram and trigram models; the unigram and bigram models transcribe one word as “vemo”, but the stronger trigram model forces the two letters to be segmented and correctly transcribes the word as “venio”, illustrating the considerable beneﬁt to be obtained by segmenting only at recognition time. 1 − ε1 Path 1 1 − ε2 a b bg ε1 Ω bg Path 2 ε2 Figure 4: The ﬁnite state machine to search for the word ‘ab.’ ‘bg’ is a place holder for the larger ﬁnite state machine deﬁned by our language model’s transition matrix. 3. Results Figure 1 shows a page from our collection. This is a scanned 12th century manuscript of Terence’s Comedies, obtained from the collection at [1]. In preprocessing, we extract individual lines of text by rotating the image to various degrees and projecting the sum of the pixel values onto the y-axis. We choose the orientation whose projection vector has the lowest entropy, and then segment lines by cutting at minima of this projection. Transcription is not our primary task, but methods that produce good transcriptions are going to support good searches. The gHMM can produce a surprisingly good transcription, given how little training data is used to train the emission model. We aligned an editors version of Terence with 25 pages from the manuscript by hand, and computed the edit distance between the transcribed text and the aligned text; as table 1 indicates, approximately 75% of letters are read correctly. Search results are strong. We show results for two documents. The ﬁrst set of results refers to the edition of Terence’s Comedies, from which we took the 22 letter instances. In particular, for any given search term, our process ranks the complete set of lines. We used a hand alignment of the manuscript to determine which lines contained each term; ﬁgure 5 shows an overview of searches performed using every word that appears in the 50 100 150 200 250 300 350 400 450 500 550 Figure 5: Our search ranks 587 manuscript lines, with higher ranking lines more likely to contain the relevant term. This ﬁgure shows complete search results for each term that appears more than three times in the 587 lines. Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. All 587 lines are represented. More common terms are represented by lower rows. More detailed results appear in ﬁgure 5 and ﬁgure 6; this summary ﬁgure suggests almost all searches are highly successful. document more than three times, in particular, showing which of the ranked set of lines actually contained the search term. For almost every search, the term appears mainly in the lines with higher rank. Figure 6 contains more detailed information for a smaller set of words. We do not score the position of a word in a line (for practical reasons). Figure 7 demonstrates (a) that our search respects the ink of the document and (b) that for the Terence document, word positions are accurately estimated. The spelling of mediaeval documents is typically cleaned up by editors; in our manuscript, the scribe reliably spells “michi” for the standard “mihi”. A search on “michi” produces many instances; a search on “mihi” produces none, because the ink doesn’t have any. Notice this phenomenon also in the bottom right line of ﬁgure 7, the scribe writes “habet, ut consumat nunc cum nichil obsint doli” and the editor gives “habet, ut consumat nunc quom nil obsint doli.” Figure 8 shows that searches on short strings produce many words containing that string as one would wish. 4. Discussion We have shown that it is possible to make at least some handwritten mediaeval manuscripts accessible to full text search, without requiring an aligned text or much supervisory data. Our documents have very regular letters, and letter frequencies — which can be obtained from transcribed Latin — appear to provide so powerful a cue that relatively little detailed information about letter shapes is required. Linking letter segmentation and recognition has thoroughly beneﬁcial effects. This suggests that the pool of manuscripts that can be made accessible in this way is large. In particular, we have used our method, trained on 22 instances of letters from one document, to search another document. Figure 9 shows the results from two searches of our second document (MS. Auct. D. 2. 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]). No information from this document was used in training at all; but letter 1tu arbitror pater etiam nisi factum primum siet vero illi inter hic michi ibi qui tu ibi michi 0.9 0.8 0.7 qui hic 0.6 inter 0.5 illi 0.4 siet 0.3 vero 0.2 nisi 0.1 50 100 150 200 250 300 350 400 450 500 550 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 6: On the left, search results for selected words (indicated on the leftmost column). Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. Note only the top 300 results are represented, and that lines containing the search term are almost always at or close to the top of the search results (black marks to the left). On the right, we plot precision against recall for a set of different words by taking the top 10, 20, ... lines returned from the search, and checking them against the aligned manuscript. Note that, once all cases have been found, if the size of the pool is increased the precision will fall with 100% recall; many words work well, with most of the ﬁrst 20 or so lines returned containing the search term. shapes are sufﬁciently well shared that the search is still useful. All this suggests that one might be able to use EM to link three processes: one that clusters to determine letter shapes; one that segments letters; and one that imposes a language model. Such a system might be able to make handwritten Latin searchable with no training data. References [1] Early Manuscripts at Oxford University. Bodleian library ms. auct. f. 2.13. http://image.ox.ac.uk/. [2] Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE T. Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. [3] Philipp Koehn and Kevin Knight. Estimating word translation probabilities from unrelated monolingual corpora. In Proc. of the 17th National Conf. on AI, pages 711–715. AAAI Press / The MIT Press, 2000. [4] Y. LeCun, L. Bottou, and Y. Bengio. Reading checks with graph transformer networks. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 151–154, Munich, 1997. IEEE. [5] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [6] R. Plamondon and S.N. Srihari. Online and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):63–84, 2000. [7] T. M. Rath and R. Manmatha. Word image matching using dynamic time warping. In Proc. of the Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 521–527, 2003. [8] Alessandro Vinciarelli, Samy Bengio, and Horst Bunke. Ofﬂine recognition of unconstrained handwritten texts using hmms and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell., 26(6):709–720, 2004. michi: Spe incerta certum mihi laborem sustuli, mihi: Faciuntne intellegendo ut nil intellegant? michi: Nonnumquam conlacrumabat. placuit tum id mihi. mihi: Placuit: despondi. hic nuptiis dictust dies. michi: Sto exspectans siquid mi imperent. venit una,

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Grace Vesom Department of Computer Science UC Berkeley Berkeley, CA 94720 Abstract We describe a method that can make a scanned, handwritten mediaeval latin manuscript accessible to full text search. [sent-3, score-0.749]

2 A generalized HMM is ﬁtted, using transcribed latin to obtain a transition model and one example each of 22 letters to obtain an emission model. [sent-4, score-0.645]

3 We show results for unigram, bigram and trigram models. [sent-5, score-0.252]

4 Our method transcribes 25 pages of a manuscript of Terence with fair accuracy (75% of letters correctly transcribed). [sent-6, score-0.41]

5 Search results are very strong; we use examples of variant spellings to demonstrate that the search respects the ink of the document. [sent-7, score-0.406]

6 Furthermore, our model produces fair searches on a document from which we obtained no training data. [sent-8, score-0.288]

7 Intoduction There are many large corpora of handwritten scanned documents, and their number is growing rapidly. [sent-10, score-0.243]

8 Large scale analyses of such corpora is currently very difﬁcult, because handwriting recognition works poorly. [sent-12, score-0.21]

9 Recently, Rath and Manmatha have demonstrated that one can use small bodies of aligned material as supervised data to train a word spotting mechanism [7]. [sent-13, score-0.378]

10 Current techniques assume a closed vocabulary — one can search only for words in the training set — and search for instances of whole words. [sent-15, score-0.545]

11 Furthermore, one would like the method used to require very little aligned training data, so that it is possible to process documents written by different scribes with little overhead. [sent-17, score-0.219]

12 Mediaeval Latin manuscripts are a natural ﬁrst corpus for studying this problem, because there are many scanned manuscripts and because the handwriting is relatively regular. [sent-18, score-0.524]

13 We expect the primary user need to be search over a large body of documents — to allow comparisons between documents — rather than transcription of a particular document (which is usually relatively easy to do by hand). [sent-19, score-0.74]

14 Desirable features for a system are: First, that it use little or no aligned training data (an ideal, which we believe may be attainable, is an unsupervised learning system). [sent-20, score-0.106]

15 Second, that one can search the document for an arbitrary string (rather than, say, only complete words that appear in the training data). [sent-21, score-0.469]

16 This would allow a user to determine whether a document contains curious or distinctive spellings, for example (ﬁgure 7). [sent-22, score-0.182]

17 We show that, using a statistical model based on a generalized HMM, we can search a medieval manuscript with considerable accuracy, using only one instance each of each letter in the manuscript to train the method (22 instances in total; Latin has no j, k, w, or z). [sent-23, score-0.987]

18 Furthermore, our method allows fairly accurate transcription of the manuscript. [sent-24, score-0.191]

19 We train our system on 22 glyphs taken from a a 12th century latin manuscript of Terence’s Comedies (obtained from a repository of over 80 scanned medieval works maintained by Oxford University [1]). [sent-25, score-0.735]

20 We evaluate searches using a considerable portion of this manuscript aligned by hand; we then show that fair search results are available on a different manuscript (MS. [sent-26, score-0.915]

21 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]) without change of letter templates. [sent-30, score-0.38]

22 Typically, online handwriting recognition (where strokes can be recorded) works better than ofﬂine handwriting recognition. [sent-34, score-0.284]

23 Recently several authors have proposed new techniques for search and translation in this unrestricted setting. [sent-37, score-0.265]

24 Manmatha et al [7] introduce the technique of “word spotting,” which segments text into word images, rectiﬁes the word images, and then uses an aligned training set to learn correspondences between rectiﬁed word images and strings. [sent-38, score-0.751]

25 In an inﬂected language, the natural unit to match to is a subset of a word, rather than a whole word, implying that one should segment the text into blocks — which may be smaller than words — while recognizing. [sent-40, score-0.123]

26 Vinciarelli et al [8] introduce a method for line by line recognition based around an HMM and quite similar to techniques used in the speech recognition community. [sent-41, score-0.214]

27 Their method requires a substantial body of aligned training data, which makes it impractical for our applications. [sent-43, score-0.106]

28 They demonstrate that the statistics of unaligned corpora may provide as powerful constraints for training models as aligned bitexts. [sent-45, score-0.146]

29 The Model Our models for both search and transcription are based on the generalized HMM and differ only in their choice of transition model. [sent-47, score-0.485]

30 In our model, the hidden nodes correspond to letters and each xt is a single column of pixels. [sent-50, score-0.106]

31 Allowing letters to emit sets of columns lets us accomodate letter templates of variable width. [sent-51, score-0.386]

32 In particular, this means that we can unify segmenting ink into letters and recognizing blocks of ink; ﬁgure 3 shows an example of how useful this is. [sent-52, score-0.239]

33 Generating a line of text Our hidden state consists of a character label c, width w and vertical position y. [sent-55, score-0.242]

34 Let T c be the template associated with character c, Tch , Tcw be respectively the height and width of that template, and m be the height of the image. [sent-57, score-0.185]

35 Figure 1: Left, a full page of our manuscript, a 12’th century manuscript of Terence’s Comedies obtained from [1]. [sent-58, score-0.367]

36 Top right, a set of lines from a page from that document and bottom right, some words in higher resolution. [sent-59, score-0.34]

37 Figure 2: Left, the 22 instances, one per letter, used to train our emission model. [sent-61, score-0.15]

38 −n ) (an n-gram letter model) choose length w ∼ Uniform(Tcw − k, Tcw + k) (for some small k) choose vertical position y ∼ Uniform(1, m − Tch ) z,y and Tch now deﬁne a bounding box b of pixels. [sent-67, score-0.205]

39 2 for greater detail on pixel emission model) • move to column w + 1 and repeat until we enter the end state Ω. [sent-70, score-0.161]

40 We have used unigram, bigram and trigram models, with each transition model ﬁtted using an electronic version of Caesar’s Gallic Wars, obtained from http://www. [sent-72, score-0.317]

41 We do not believe that the choice of author should signiﬁcantly affect the ﬁtted transition model — which is at the level of characters — but have not experimented with this point. [sent-75, score-0.109]

42 The Emission Model Our emission model is as follows: Given the character c and width w, we generate a template of the required length. [sent-79, score-0.298]

43 Each pixel in this template becomes the mean of a gaussian which generates the corresponding pixel in the image. [sent-80, score-0.162]

44 This template has a separate mean image for each pixel channel. [sent-81, score-0.114]

45 We train the model by cutting out by hand a single instance of each letter from our corpus (ﬁgure 2). [sent-83, score-0.28]

46 Pixels above and below this Model Perfect transcription unigram bigram trigram matching chars 21019 14603 15572 15788 substitutions 0 5487 4597 4410 insertions 0 534 541 507 deletions 0 773 718 695 Table 1: Edit distance between our transcribed Terence and the editor’s version. [sent-85, score-0.747]

47 Note the trigram model produces signiﬁcantly fewer letter errors than the unigram model, but that the error rate is still a substantial 25%. [sent-86, score-0.515]

48 The width of a character is constrained to be close to the width (tw ) of our hand cut example by setting p(w|c) = 0 for w < tw − k and w > tw + k. [sent-90, score-0.295]

49 Transcription For transcription, we model letters as coming from an n-gram language model, with no dependencies between words. [sent-102, score-0.163]

50 Thus, the probability of a letter depends on the k letters before it, where k = n unless this would cross a word boundary in which case the history terminates at this boundary. [sent-103, score-0.502]

51 We chose not to model word to word transition probabilities since, unlike in English, word order in Latin is highly arbitrary. [sent-104, score-0.638]

52 We can perform transcription by ﬁtting the maximum likelihood path through any given line. [sent-109, score-0.191]

53 Search For search, we rank lines by the probability that they contain our search word. [sent-113, score-0.329]

54 In this ﬁgure, ‘bg’ represents our background model for that portion of the line not generated by our search word. [sent-115, score-0.28]

55 We can use any of the n-gram letter models described for transcription as the transition model for ‘bg’. [sent-116, score-0.461]

56 The probability that the line contains the search word is the probability that this FSM takes path 1. [sent-117, score-0.471]

57 1 − ε1 Path 1 1 − ε2 a b bg ε1 Ω bg Path 2 ε2 Figure 4: The ﬁnite state machine to search for the word ‘ab. [sent-126, score-0.61]

58 ’ ‘bg’ is a place holder for the larger ﬁnite state machine deﬁned by our language model’s transition matrix. [sent-127, score-0.122]

59 This is a scanned 12th century manuscript of Terence’s Comedies, obtained from the collection at [1]. [sent-130, score-0.426]

60 In preprocessing, we extract individual lines of text by rotating the image to various degrees and projecting the sum of the pixel values onto the y-axis. [sent-131, score-0.22]

61 We choose the orientation whose projection vector has the lowest entropy, and then segment lines by cutting at minima of this projection. [sent-132, score-0.138]

62 The gHMM can produce a surprisingly good transcription, given how little training data is used to train the emission model. [sent-134, score-0.15]

63 We aligned an editors version of Terence with 25 pages from the manuscript by hand, and computed the edit distance between the transcribed text and the aligned text; as table 1 indicates, approximately 75% of letters are read correctly. [sent-135, score-0.741]

64 The ﬁrst set of results refers to the edition of Terence’s Comedies, from which we took the 22 letter instances. [sent-138, score-0.205]

65 In particular, for any given search term, our process ranks the complete set of lines. [sent-139, score-0.229]

66 This ﬁgure shows complete search results for each term that appears more than three times in the 587 lines. [sent-141, score-0.229]

67 Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. [sent-142, score-0.863]

68 document more than three times, in particular, showing which of the ranked set of lines actually contained the search term. [sent-146, score-0.473]

69 We do not score the position of a word in a line (for practical reasons). [sent-149, score-0.242]

70 Figure 7 demonstrates (a) that our search respects the ink of the document and (b) that for the Terence document, word positions are accurately estimated. [sent-150, score-0.697]

71 The spelling of mediaeval documents is typically cleaned up by editors; in our manuscript, the scribe reliably spells “michi” for the standard “mihi”. [sent-151, score-0.201]

72 A search on “michi” produces many instances; a search on “mihi” produces none, because the ink doesn’t have any. [sent-152, score-0.591]

73 Notice this phenomenon also in the bottom right line of ﬁgure 7, the scribe writes “habet, ut consumat nunc cum nichil obsint doli” and the editor gives “habet, ut consumat nunc quom nil obsint doli. [sent-153, score-0.697]

74 ” Figure 8 shows that searches on short strings produce many words containing that string as one would wish. [sent-154, score-0.192]

75 Discussion We have shown that it is possible to make at least some handwritten mediaeval manuscripts accessible to full text search, without requiring an aligned text or much supervisory data. [sent-156, score-0.634]

76 Our documents have very regular letters, and letter frequencies — which can be obtained from transcribed Latin — appear to provide so powerful a cue that relatively little detailed information about letter shapes is required. [sent-157, score-0.612]

77 Linking letter segmentation and recognition has thoroughly beneﬁcial effects. [sent-158, score-0.261]

78 This suggests that the pool of manuscripts that can be made accessible in this way is large. [sent-159, score-0.197]

79 In particular, we have used our method, trained on 22 instances of letters from one document, to search another document. [sent-160, score-0.371]

80 Figure 9 shows the results from two searches of our second document (MS. [sent-161, score-0.24]

81 16, Latin Gospels with beast-headed evangelist portraits made at Landvennec, Brittany, late 9th or early 10th century, from [1]). [sent-165, score-0.175]

82 No information from this document was used in training at all; but letter 1tu arbitror pater etiam nisi factum primum siet vero illi inter hic michi ibi qui tu ibi michi 0. [sent-166, score-1.394]

83 9 1 Figure 6: On the left, search results for selected words (indicated on the leftmost column). [sent-184, score-0.28]

84 Each row represents the ranked search results for a term, and a black mark appears if the search term is actually in the line; a successful search will therefore appear as a row which is wholly dark to the left, and then wholly light. [sent-185, score-0.863]

85 Note only the top 300 results are represented, and that lines containing the search term are almost always at or close to the top of the search results (black marks to the left). [sent-186, score-0.558]

86 lines returned from the search, and checking them against the aligned manuscript. [sent-190, score-0.286]

87 Note that, once all cases have been found, if the size of the pool is increased the precision will fall with 100% recall; many words work well, with most of the ﬁrst 20 or so lines returned containing the search term. [sent-191, score-0.46]

88 shapes are sufﬁciently well shared that the search is still useful. [sent-192, score-0.229]

89 All this suggests that one might be able to use EM to link three processes: one that clusters to determine letter shapes; one that segments letters; and one that imposes a language model. [sent-193, score-0.262]

90 Such a system might be able to make handwritten Latin searchable with no training data. [sent-194, score-0.143]

91 Ofﬂine recognition of unconstrained handwritten texts using hmms and statistical language models. [sent-244, score-0.212]

92 michi: Spe incerta certum mihi laborem sustuli, mihi: Faciuntne intellegendo ut nil intellegant? [sent-250, score-0.311]

93 hic furti se adligat: mihi: Habet, ut consumat nunc quom nil obsint doli; michi: Quando nec gnatu’ neque hic mi quicquam obtemperant, Figure 7: The handwritten text does not fully correspond to the transcribed version; for example, scribes commonly write “michi” for the standard “mihi”. [sent-261, score-0.922]

94 Our search process reﬂects the ink fairly faithfully, however. [sent-262, score-0.362]

95 Left the ﬁrst four lines returned for a search on the string “michi”; right the ﬁrst four lines returned for a search on the string “mihi”, which does not appear in the document. [sent-263, score-0.908]

96 Note that our search process can offer scholars access to the ink in a particular document, useful for studying variations in transcription, etc. [sent-264, score-0.362]

97 tu: Quae ibi aderant forte unam aspicio adulescentulam Figure 8: Searches on short strings produce substrings of words as well as words (we show the ﬁrst two lines returned from a search for “tu”). [sent-266, score-0.577]

98 sunt interrogaverunt Figure 9: The ﬁrst six lines returned from the second manuscript, (MS. [sent-267, score-0.224]

99 We do not have aligned text, so cannot measure the recall and precision for searches on this document. [sent-272, score-0.202]

100 The recall and precision are clearly not as good as those for the Terence document, the search is reasonably satisfactory, given that no training information from this document was available. [sent-273, score-0.373]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('search', 0.229), ('latin', 0.228), ('michi', 0.219), ('terence', 0.219), ('manuscript', 0.218), ('letter', 0.205), ('mihi', 0.197), ('transcription', 0.191), ('word', 0.191), ('unigram', 0.171), ('manuscripts', 0.153), ('document', 0.144), ('trigram', 0.139), ('ink', 0.133), ('transcribed', 0.133), ('tu', 0.114), ('handwriting', 0.114), ('emission', 0.113), ('bigram', 0.113), ('hic', 0.109), ('letters', 0.106), ('aligned', 0.106), ('century', 0.104), ('scanned', 0.104), ('lines', 0.1), ('handwritten', 0.099), ('searches', 0.096), ('bg', 0.095), ('comedies', 0.088), ('ghmm', 0.088), ('mediaeval', 0.088), ('wholly', 0.088), ('returned', 0.08), ('hmm', 0.075), ('character', 0.075), ('gure', 0.073), ('text', 0.072), ('documents', 0.069), ('template', 0.066), ('brittany', 0.066), ('consumat', 0.066), ('evangelist', 0.066), ('gospels', 0.066), ('habet', 0.066), ('ibi', 0.066), ('landvennec', 0.066), ('nil', 0.066), ('nunc', 0.066), ('obsint', 0.066), ('tch', 0.066), ('tcw', 0.066), ('tw', 0.066), ('transition', 0.065), ('language', 0.057), ('portraits', 0.057), ('recognition', 0.056), ('late', 0.052), ('words', 0.051), ('line', 0.051), ('ut', 0.048), ('fair', 0.048), ('pixel', 0.048), ('string', 0.045), ('page', 0.045), ('width', 0.044), ('doli', 0.044), ('fsm', 0.044), ('illi', 0.044), ('koehn', 0.044), ('manmatha', 0.044), ('medieval', 0.044), ('nisi', 0.044), ('placuit', 0.044), ('quom', 0.044), ('rath', 0.044), ('scribe', 0.044), ('scribes', 0.044), ('searchable', 0.044), ('siet', 0.044), ('spellings', 0.044), ('spotting', 0.044), ('sunt', 0.044), ('venio', 0.044), ('vero', 0.044), ('vinciarelli', 0.044), ('experimented', 0.044), ('accessible', 0.044), ('corpora', 0.04), ('user', 0.038), ('qui', 0.038), ('cutting', 0.038), ('emit', 0.038), ('inter', 0.038), ('transcribe', 0.038), ('transcribes', 0.038), ('ected', 0.037), ('templates', 0.037), ('train', 0.037), ('instances', 0.036), ('translation', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

2 0.12409593 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

Author: Michael Fink

Abstract: We describe a framework for learning an object classiﬁer from a single example. This goal is achieved by emphasizing the relevant dimensions for classiﬁcation using available examples of related classes. Learning to accurately classify objects from a single training example is often unfeasible due to overﬁtting effects. However, if the instance representation provides that the distance between each two instances of the same class is smaller than the distance between any two instances from different classes, then a nearest neighbor classiﬁer could achieve perfect performance with a single training example. We therefore suggest a two stage strategy. First, learn a metric over the instances that achieves the distance criterion mentioned above, from available examples of other related classes. Then, using the single examples, deﬁne a nearest neighbor classiﬁer where distance is evaluated by the learned class relevance metric. Finding a metric that emphasizes the relevant dimensions for classiﬁcation might not be possible when restricted to linear projections. We therefore make use of a kernel based metric learning algorithm. Our setting encodes object instances as sets of locality based descriptors and adopts an appropriate image kernel for the class relevance metric learning. The proposed framework for learning from a single example is demonstrated in a synthetic setting and on a character classiﬁcation task. 1

3 0.10284807 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously ﬁnd syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classiﬁcation with models that exclusively use short- and long-range dependencies respectively. 1

4 0.10118555 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with signiﬁcantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

5 0.088857204 200 nips-2004-Using Random Forests in the Structured Language Model

Author: Peng Xu, Frederick Jelinek

Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classiﬁers, are a combination of decision tree classiﬁers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1

6 0.087558657 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

7 0.078361191 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

8 0.076326936 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

9 0.067370035 54 nips-2004-Distributed Information Regularization on Graphs

10 0.059052192 124 nips-2004-Multiple Alignment of Continuous Time Series

11 0.057485513 192 nips-2004-The power of feature clustering: An application to object detection

12 0.055341657 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

13 0.054304514 91 nips-2004-Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters

14 0.054303367 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

15 0.051367968 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

16 0.046545036 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

17 0.045366198 40 nips-2004-Common-Frame Model for Object Recognition

18 0.045322411 145 nips-2004-Parametric Embedding for Class Visualization

19 0.044584438 205 nips-2004-Who's In the Picture

20 0.043135595 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.154), (1, 0.023), (2, 0.0), (3, -0.107), (4, 0.049), (5, 0.08), (6, 0.007), (7, 0.065), (8, 0.083), (9, -0.015), (10, -0.014), (11, -0.012), (12, -0.171), (13, 0.059), (14, -0.058), (15, -0.022), (16, 0.16), (17, -0.06), (18, 0.002), (19, -0.017), (20, -0.039), (21, -0.045), (22, -0.031), (23, -0.072), (24, 0.021), (25, -0.065), (26, 0.054), (27, 0.035), (28, 0.059), (29, 0.095), (30, -0.02), (31, -0.032), (32, -0.138), (33, -0.09), (34, 0.08), (35, -0.088), (36, -0.048), (37, 0.095), (38, -0.058), (39, 0.055), (40, -0.007), (41, -0.026), (42, -0.127), (43, 0.001), (44, 0.142), (45, 0.022), (46, 0.038), (47, -0.037), (48, -0.079), (49, -0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95687968 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

2 0.61993855 87 nips-2004-Integrating Topics and Syntax

Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum

3 0.52875209 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

4 0.51307255 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature. 1

5 0.51122451 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

Author: Giorgio Gia\-cin\-to, Fabio Roli

Abstract: High retrieval precision in content-based image retrieval can be attained by adopting relevance feedback mechanisms. These mechanisms require that the user judges the quality of the results of the query by marking all the retrieved images as being either relevant or not. Then, the search engine exploits this information to adapt the search to better meet user’s needs. At present, the vast majority of proposed relevance feedback mechanisms are formulated in terms of search model that has to be optimized. Such an optimization involves the modification of some search parameters so that the nearest neighbor of the query vector contains the largest number of relevant images. In this paper, a different approach to relevance feedback is proposed. After the user provides the first feedback, following retrievals are not based on knn search, but on the computation of a relevance score for each image of the database. This score is computed as a function of two distances, namely the distance from the nearest non-relevant image and the distance from the nearest relevant one. Images are then ranked according to this score and the top k images are displayed. Reported results on three image data sets show that the proposed mechanism outperforms other state-of-the-art relevance feedback mechanisms. 1 In t rod u ct i on A large number of content-based image retrieval (CBIR) systems rely on the vector representation of images in a multidimensional feature space representing low-level image characteristics, e.g., color, texture, shape, etc. [1]. Content-based queries are often expressed by visual examples in order to retrieve from the database the images that are “similar” to the examples. This kind of retrieval is often referred to as K nearest-neighbor retrieval. It is easy to see that the effectiveness of content-based image retrieval systems (CBIR) strongly depends on the choice of the set of visual features, on the choice of the “metric” used to model the user’s perception of image similarity, and on the choice of the image used to query the database [1]. Typically, if we allow different users to mark the images retrieved with a given query as relevant or non-relevant, different subsets of images will be marked as relevant. Accordingly, the need for mechanisms to adapt the CBIR system response based on some feedback from the user is widely recognized. It is interesting to note that while relevance feedback mechanisms have been first introduced in the information retrieval field [2], they are receiving more attention in the CBIR field (Huang). The vast majority of relevance feedback techniques proposed in the literature is based on modifying the values of the search parameters as to better represent the concept the user bears in mind. To this end, search parameters are computed as a function of the relevance values assigned by the user to all the images retrieved so far. As an example, relevance feedback is often formulated in terms of the modification of the query vector, and/or in terms of adaptive similarity metrics. [3]-[7]. Recently, pattern classification paradigms such as SVMs have been proposed [8]. Feedback is thus used to model the concept of relevant images and adjust the search consequently. Concept modeling may be difficult on account of the distribution of relevant images in the selected feature space. “Narrow domain” image databases allows extracting good features, so that images bearing similar concepts belong to compact clusters. On the other hand, “broad domain” databases, such as image collection used by graphic professionals, or those made up of images from the Internet, are more difficult to subdivide in cluster because of the high variability of concepts [1]. In these cases, it is worth extracting only low level, non-specialized features, and image retrieval is better formulated in terms of a search problem rather then concept modeling. The present paper aims at offering an original contribution in this direction. Rather then modeling the concept of “relevance” the user bears in mind, feedback is used to assign each image of the database a relevance score. Such a score depends only from two dissimilarities (distances) computed against the images already marked by the user: the dissimilarity from the set of relevant images, and the dissimilarity from the set of non-relevant images. Despite its computational simplicity, this mechanism allows outperforming state-of-the-art relevance feedback mechanisms both on “narrow domain” databases, and on “broad domain” databases. This paper is organized as follows. Section 2 illustrates the idea behind the proposed mechanism and provides the basic assumptions. Section 3 details the proposed relevance feedback mechanism. Results on three image data sets are presented in Section 4, where performances of other relevance feedback mechanisms are compared. Conclusions are drawn in Section 5. 2 In st an ce- b ased rel evan ce est i m at i on The proposed mechanism has been inspired by classification techniques based on the “nearest case” [9]-[10]. Nearest-case theory provided the mechanism to compute the dissimilarity of each image from the sets of relevant and non–relevant images. The ratio between the nearest relevant image and the nearest non-relevant image has been used to compute the degree of relevance of each image of the database [11]. The present section illustrates the rationale behind the use of the nearest-case paradigm. Let us assume that each image of the database has been represented by a number of low-level features, and that a (dis)similarity measure has been defined so that the proximity between pairs of images represents some kind of “conceptual” similarity. In other words, the chosen feature space and similarity metric is meaningful at least for a restricted number of users. A search in image databases is usually performed by retrieving the k most similar images with respect to a given query. The dimension of k is usually small, to avoid displaying a large number of images at a time. Typical values for k are between 10 and 20. However, as the “relevant” images that the user wishes to retrieve may not fit perfectly with the similarity metric designed for the search engine, the user may be interested in exploring other regions of the feature space. To this end, the user marks the subset of “relevant” images out of the k retrieved. Usually, such relevance feedback is used to perform a new k-nn search by modifying some search parameters, i.e., the position of the query point, the similarity metric, and other tuning parameters [1]-[7]. Recent works proposed the use of support vector machine to learn the distribution of relevant images [8]. These techniques require some assumption about the general form of the distribution of relevant images in the feature space. As it is difficult to make any assumption about such a distribution for broad domain databases, we propose to exploit the information about the relevance of the images retrieved so far in a nearest-neighbor fashion. Nearest-neighbor techniques, as used in statistical pattern recognition, case-based reasoning, or instance-based learning, are effective in all applications where it is difficult to produce a high-level generalization of a “class” of objects [9]-[10],[12][13]. Relevance learning in content base image retrieval may well fit into this definition, as it is difficult to provide a general model that can be adapted to represent different concepts of similarity. In addition, the number of available cases may be too small to estimate the optimal set of parameters for such a general model. On the other hand, it can be more effective to use each “relevant” image as well as each “non-relevant” image, as “cases” or “instances” against which the images of the database should be compared. Consequently, we assume that an image is as much as relevant as much as its dissimilarity from the nearest relevant image is small. Analogously, an image is as much as non-relevant as much as its dissimilarity from the nearest non-relevant image is small. 3 Rel evan ce S core Com p u t ati on According to previous section, each image of the database can be thus characterized by a “degree of relevance” and a “degree of non-relevance” according to the dissimilarities from the nearest relevant image, and from the nearest non-relevant image, respectively. However, it should be noted that these degrees should be treated differently because only “relevant” images represent a “concept” in the user’s mind, while “non-relevant” images may represent a number of other concepts different from user’s interest. In other words, while it is meaningful to treat the degree of relevance as a degree of membership to the class of relevant images, the same does not apply to the degree of non-relevance. For this reason, we propose to use the “degree of non-relevance” to weight the “degree of relevance”. Let us denote with R the subset of indexes j ∈ {1,...,k} related to the set of relevant images retrieved so far and the original query (that is relevant by default), and with NR the subset of indexes j ∈ (1,...,k} related to the set of non-relevant images retrieved so far. For each image I of the database, according to the nearest neighbor rule, let us compute the dissimilarity from the nearest image in R and the dissimilarity from the nearest image in NR. Let us denote these dissimilarities as dR(I) and dNR(I), respectively. The value of dR(I) can be clearly used to measure the degree of relevance of image I, assuming that small values of dR(I) are related to very relevant images. On the other hand, the hypothesis that image I is relevant to the user’s query can be supported by a high value of dNR(I). Accordingly, we defined the relevance score ! dR ( I ) $ relevance ( I ) = # 1 + dN ( I ) &

6 0.50282794 200 nips-2004-Using Random Forests in the Structured Language Model

7 0.48879266 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

8 0.46170619 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

9 0.46159977 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

10 0.43010059 193 nips-2004-Theories of Access Consciousness

11 0.42047226 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

12 0.42024535 74 nips-2004-Harmonising Chorales by Probabilistic Inference

13 0.37604973 6 nips-2004-A Hidden Markov Model for de Novo Peptide Sequencing

14 0.36309031 205 nips-2004-Who's In the Picture

15 0.36132884 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

16 0.34833905 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

17 0.31212655 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

18 0.30881047 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

19 0.30879778 127 nips-2004-Neighbourhood Components Analysis

20 0.30413607 68 nips-2004-Face Detection --- Efficient and Rank Deficient

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.074), (15, 0.091), (26, 0.044), (31, 0.021), (33, 0.129), (35, 0.035), (39, 0.019), (50, 0.029), (51, 0.479)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78227299 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

Author: Jaety Edwards, Yee W. Teh, Roger Bock, Michael Maire, Grace Vesom, David A. Forsyth

2 0.5982495 82 nips-2004-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni

Abstract: We study the problem of hierarchical classiﬁcation when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classiﬁer for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classiﬁer with a ﬂat SVM classiﬁer and with classiﬁers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classiﬁer performs, after just one training epoch, nearly as well as the hierarchical SVM classiﬁer (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classiﬁer based on the true model parameters. 1 Introduction and basic deﬁnitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is speciﬁed as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classiﬁcation have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classiﬁcation tasks, and we derive the corresponding Bayes-optimal classiﬁer under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reﬂects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classiﬁer; 3) a simpliﬁed variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classiﬁer based on the true model parameters. Our incremental algorithms are based on training linear-threshold classiﬁers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classiﬁcations as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classiﬁcation system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a ﬂat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are deﬁned over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some ﬁxed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are deﬁned over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make ﬁne-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is deﬁned by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are ﬁxed cost coefﬁcients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N deﬁne its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model deﬁnes a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and deﬁning fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this deﬁnition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a ﬁxed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we deﬁne the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simpliﬁes to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classiﬁer for the H-loss We now describe a classiﬁer, called H - BAYES, that is the Bayes-optimal classiﬁer for the H-loss. In other words, H - BAYES classiﬁes any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Deﬁne pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, ﬁx any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively deﬁne H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classiﬁer H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classiﬁcation of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classiﬁer by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classiﬁer generated by these algorithms classiﬁes an instance x through a multilabel y = (y1 , ..., yN ) deﬁned as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classiﬁers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the ﬁrst t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classiﬁer (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The ﬁrst algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are deﬁned by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classiﬁer of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates deﬁned by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the deﬁnition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classiﬁer by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simpliﬁed variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are deﬁned as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The ﬁrst dataset consists of the ﬁrst (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the ﬁfth and ﬁrst group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a speciﬁc subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classiﬁcation tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance ﬁgures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a signiﬁcant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classiﬁcation problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefﬁcients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefﬁcients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classiﬁer trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classiﬁcation scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We ﬁnally tested the “ﬂat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefﬁcients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the ﬂat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a ﬁnal remark, we observe that the normalized H-loss gave too small a range of values to afford ﬁne comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefﬁcients u1 , . . . , uN . We deﬁne the regret of a classiﬁer assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classiﬁer (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classiﬁer learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we deﬁne the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modiﬁed H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classiﬁcation of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classiﬁer, and empirically compared an incremental approximation to this classiﬁer with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simpliﬁed variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classiﬁer is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classiﬁer, rather than with respect to a classiﬁer that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classiﬁers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classiﬁcation with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classiﬁcation. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classiﬁcation of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classiﬁcation using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classiﬁcation and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classiﬁcation by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classiﬁer. In Proceedings of the 13th European Conference on Artiﬁcial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classiﬁcation and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.

3 0.59536719 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik

Abstract: We describe an algorithm for support vector machines (SVM) that can be parallelized efficiently and scales to very large problems with hundreds of thousands of training vectors. Instead of analyzing the whole training set in one optimization step, the data are split into subsets and optimized separately with multiple SVMs. The partial results are combined and filtered again in a ‘Cascade’ of SVMs, until the global optimum is reached. The Cascade SVM can be spread over multiple processors with minimal communication overhead and requires far less memory, since the kernel matrices are much smaller than for a regular SVM. Convergence to the global optimum is guaranteed with multiple passes through the Cascade, but already a single pass provides good generalization. A single pass is 5x – 10x faster than a regular SVM for problems of 100,000 vectors when implemented on a single processor. Parallel implementations on a cluster of 16 processors were tested with over 1 million vectors (2-class problems), converging in a day or two, while a regular SVM never converged in over a week. 1

4 0.39178097 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efﬁciently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical ﬁndings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. 1

5 0.37739405 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

Author: Yoshitatsu Matsuda, Kazunori Yamaguchi

Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efﬁciently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efﬁcient and effective in large-size natural image processing.

6 0.37196499 73 nips-2004-Generative Affine Localisation and Tracking

7 0.3680951 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

8 0.35941699 131 nips-2004-Non-Local Manifold Tangent Learning

9 0.3584401 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

10 0.35533407 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

11 0.35467488 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

12 0.35451549 102 nips-2004-Learning first-order Markov models for control

13 0.35424986 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.35408008 69 nips-2004-Fast Rates to Bayes for Kernel Machines

15 0.35380962 127 nips-2004-Neighbourhood Components Analysis

16 0.35374272 178 nips-2004-Support Vector Classification with Input Data Uncertainty

17 0.3533099 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

18 0.35315827 70 nips-2004-Following Curved Regularized Optimization Solution Paths

19 0.35268062 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

20 0.35248408 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images