Suchen und Finden
Mehr zum Inhalt
Automatic Digital Document Processing and Management - Problems, Algorithms and Techniques
Foreword
6
Preface
9
Acknowledgments
12
Contents
13
Acronyms
19
Digital Documents
23
Documents
25
A Juridic Perspective
25
History and Trends
26
Current Landscape
27
Types of Documents
29
Document-Based Environments
32
Document Processing Needs
33
References
34
Digital Formats
36
Compression Techniques
37
RLE (Run Length Encoding)
37
Huffman Encoding
37
LZ77 and LZ78 (Lempel-Ziv)
39
LZW (Lempel-Ziv-Welch)
40
DEFLATE
42
Non-structured Formats
42
Plain Text
43
ASCII
44
ISO Latin
44
UNICODE
45
UTF
45
Images
49
Color Spaces
49
RGB
50
YUV/YCbCr
50
CMY(K)
51
HSV/HSB and HLS
51
Comparison among Color Spaces
51
Raster Graphics
52
BMP (BitMaP)
53
GIF (Graphics Interchange Format)
55
TIFF (Tagged Image File Format)
57
JPEG (Joint Photographic Experts Group)
58
PNG (Portable Network Graphics)
60
DjVu (DejaVu)
62
Vector Graphic
64
SVG (Scalable Vector Graphic)
64
Layout-Based Formats
66
PS (PostScript)
66
PDF (Portable Document Format)
77
Content-Oriented Formats
80
Tag-Based Formats
81
HTML (HyperText Markup Language)
82
XML (eXtensible Markup Language)
87
Office Formats
90
ODF (OpenDocument Format)
90
References
91
Legal and Security Aspects
93
Cryptography
94
Basics
94
Short History
96
Digital Cryptography
97
DES (Data Encryption Standard)
99
IDEA (International Data Encryption Algorithm)
100
Key Exchange Method
101
RSA (Rivest, Shamir, Adleman)
102
DSA (Digital Signature Algorithm)
105
Message Fingerprint
105
SHA (Secure Hash Algorithm)
106
Digital Signature
108
Management
110
DSS (Digital Signature Standard)
112
OpenPGP Standard
113
Trusting and Certificates
114
Legal Aspects
117
A Law Approach
118
Public Administration Initiatives
121
Digital Signature
121
Certified e-mail
123
Electronic Identity Card & National Services Card
124
Telematic Civil Proceedings
124
References
128
Document Analysis
130
Image Processing
132
Basics
133
Convolution and Correlation
133
Color Representation
135
Color Space Conversions
136
RGB-YUV
136
RGB-YCbCr
136
RGB-CMY(K)
137
RGB-HSV
137
RGB-HLS
138
Colorimetric Color Spaces
139
XYZ
139
L*a*b*
140
Color Depth Reduction
141
Desaturation
141
Grayscale (Luminance)
142
Black&White (Binarization)
142
Otsu Thresholding
142
Content Processing
143
Geometrical Transformations
144
Edge Enhancement
145
Derivative Filters
146
Connectivity
148
Flood Filling
149
Border Following
150
Dilation and Erosion
151
Opening and Closing
152
Edge Detection
153
Canny
154
Hough Transform
156
Polygonal Approximation
158
Snakes
160
References
162
Document Image Analysis
163
Document Structures
163
Spatial Description
165
4-Intersection Model
166
Minimum Bounding Rectangles
168
Logical Structure Description
169
DOM (Document Object Model)
169
Pre-processing for Digitized Documents
172
Document Image Defect Models
173
Deskewing
174
Dewarping
175
Segmentation-Based Dewarping
176
Content Identification
178
Optical Character Recognition
179
Tesseract
181
JTOCR
183
Segmentation
184
Classification of Segmentation Techniques
185
Pixel-Based Segmentation
187
RLSA (Run Length Smoothing Algorithm)
187
RLSO (Run-Length Smoothing with OR)
189
X-Y Trees
191
Block-Based Segmentation
193
The DOCSTRUM
193
The CLiDE (Chemical Literature Data Extraction) Approach
195
Background Analysis
197
RLSO on Born-Digital Documents
201
Document Image Understanding
202
Relational Approach
204
INTHELEX (INcremental THEory Learner from EXamples)
206
Description
208
DCMI (Dublin Core Metadata Initiative)
209
References
211
Content Processing
215
Natural Language Processing
217
Resources-Lexical Taxonomies
218
WordNet
219
WordNet Domains
220
Senso Comune
223
Tools
224
Tokenization
225
Language Recognition
226
Stopword Removal
227
Stemming
228
Suffix Stripping
229
Part-of-Speech Tagging
231
Rule-Based Approach
231
Word Sense Disambiguation
233
Lesk's Algorithm
235
Yarowsky's Algorithm
235
Parsing
236
Link Grammar
237
References
239
Information Management
241
Information Retrieval
241
Performance Evaluation
242
Indexing Techniques
244
Vector Space Model
244
Query Evaluation
247
Relevance Feedback
248
Dimensionality Reduction
249
Latent Semantic Analysis and Indexing
250
Concept Indexing
253
Image Retrieval
255
Keyword Extraction
257
TF-ITP
259
Naive Bayes
259
Co-occurrence
260
Text Categorization
262
A Semantic Approach Based on WordNet Domains
264
Information Extraction
265
WHISK
267
A Multistrategy Approach
269
The Semantic Web
271
References
272
Appendix A A Case Study: DOMINUS
274
General Framework
274
Actors and Workflow
274
Architecture
276
Functionality
278
Input Document Normalization
278
Layout Analysis
279
Kernel-Based Basic Blocks Grouping
280
Document Image Understanding
281
Categorization, Filing and Indexing
281
Prototype Implementation
282
Exploitation for Scientific Conference Management
285
GRAPE
286
Appendix B Machine Learning Notions
288
Categorization of Techniques
288
Noteworthy Techniques
289
Artificial Neural Networks
289
Decision Trees
290
k-Nearest Neighbor
290
Inductive Logic Programming
290
Naive Bayes
291
Hidden Markov Models
291
Clustering
291
Experimental Strategies
292
k-Fold Cross-Validation
292
Leave-One-Out
293
Random Split
293
Glossary
294
Bounding box
294
Byte ordering
294
Ceiling function
294
Chunk
294
Connected component
294
Heaviside unit function
294
Heterarchy
295
KL-divergence
295
Linear regression
295
Run
295
Scanline
295
References
296
Index
305
Alle Preise verstehen sich inklusive der gesetzlichen MwSt.