Automatic Digital Document Processing and Management - Problems, Algorithms and Techniques

von: Stefano Ferilli

Springer-Verlag, 2011

ISBN: 9780857291981 , 297 Seiten

Format: PDF, OL

Kopierschutz: Wasserzeichen

Windows PC,Mac OSX für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Online-Lesen für: Windows PC,Mac OSX,Linux

Preis: 96,29 EUR

  • AutoCAD 2012 - Von der 2D-Linie zum 3D-Modell
    Organisiert (DIGITAL lifeguide) - Termine, Kontakte, Aufgaben immer & überall im Griff
    iTunes (DIGITAL lifeguide) - Die besten Tipps und Tricks für entspannten Musikgenuss
    Von PDM zu PLM - Prozessoptimierung durch Integration
    Konstruieren mit CAD - Das Komplettpaket für 3D Modellieren im Maschinenbau

     

     

     

     

 

Mehr zum Inhalt

Automatic Digital Document Processing and Management - Problems, Algorithms and Techniques


 

Foreword

6

Preface

9

Acknowledgments

12

Contents

13

Acronyms

19

Digital Documents

23

Documents

25

A Juridic Perspective

25

History and Trends

26

Current Landscape

27

Types of Documents

29

Document-Based Environments

32

Document Processing Needs

33

References

34

Digital Formats

36

Compression Techniques

37

RLE (Run Length Encoding)

37

Huffman Encoding

37

LZ77 and LZ78 (Lempel-Ziv)

39

LZW (Lempel-Ziv-Welch)

40

DEFLATE

42

Non-structured Formats

42

Plain Text

43

ASCII

44

ISO Latin

44

UNICODE

45

UTF

45

Images

49

Color Spaces

49

RGB

50

YUV/YCbCr

50

CMY(K)

51

HSV/HSB and HLS

51

Comparison among Color Spaces

51

Raster Graphics

52

BMP (BitMaP)

53

GIF (Graphics Interchange Format)

55

TIFF (Tagged Image File Format)

57

JPEG (Joint Photographic Experts Group)

58

PNG (Portable Network Graphics)

60

DjVu (DejaVu)

62

Vector Graphic

64

SVG (Scalable Vector Graphic)

64

Layout-Based Formats

66

PS (PostScript)

66

PDF (Portable Document Format)

77

Content-Oriented Formats

80

Tag-Based Formats

81

HTML (HyperText Markup Language)

82

XML (eXtensible Markup Language)

87

Office Formats

90

ODF (OpenDocument Format)

90

References

91

Legal and Security Aspects

93

Cryptography

94

Basics

94

Short History

96

Digital Cryptography

97

DES (Data Encryption Standard)

99

IDEA (International Data Encryption Algorithm)

100

Key Exchange Method

101

RSA (Rivest, Shamir, Adleman)

102

DSA (Digital Signature Algorithm)

105

Message Fingerprint

105

SHA (Secure Hash Algorithm)

106

Digital Signature

108

Management

110

DSS (Digital Signature Standard)

112

OpenPGP Standard

113

Trusting and Certificates

114

Legal Aspects

117

A Law Approach

118

Public Administration Initiatives

121

Digital Signature

121

Certified e-mail

123

Electronic Identity Card & National Services Card

124

Telematic Civil Proceedings

124

References

128

Document Analysis

130

Image Processing

132

Basics

133

Convolution and Correlation

133

Color Representation

135

Color Space Conversions

136

RGB-YUV

136

RGB-YCbCr

136

RGB-CMY(K)

137

RGB-HSV

137

RGB-HLS

138

Colorimetric Color Spaces

139

XYZ

139

L*a*b*

140

Color Depth Reduction

141

Desaturation

141

Grayscale (Luminance)

142

Black&White (Binarization)

142

Otsu Thresholding

142

Content Processing

143

Geometrical Transformations

144

Edge Enhancement

145

Derivative Filters

146

Connectivity

148

Flood Filling

149

Border Following

150

Dilation and Erosion

151

Opening and Closing

152

Edge Detection

153

Canny

154

Hough Transform

156

Polygonal Approximation

158

Snakes

160

References

162

Document Image Analysis

163

Document Structures

163

Spatial Description

165

4-Intersection Model

166

Minimum Bounding Rectangles

168

Logical Structure Description

169

DOM (Document Object Model)

169

Pre-processing for Digitized Documents

172

Document Image Defect Models

173

Deskewing

174

Dewarping

175

Segmentation-Based Dewarping

176

Content Identification

178

Optical Character Recognition

179

Tesseract

181

JTOCR

183

Segmentation

184

Classification of Segmentation Techniques

185

Pixel-Based Segmentation

187

RLSA (Run Length Smoothing Algorithm)

187

RLSO (Run-Length Smoothing with OR)

189

X-Y Trees

191

Block-Based Segmentation

193

The DOCSTRUM

193

The CLiDE (Chemical Literature Data Extraction) Approach

195

Background Analysis

197

RLSO on Born-Digital Documents

201

Document Image Understanding

202

Relational Approach

204

INTHELEX (INcremental THEory Learner from EXamples)

206

Description

208

DCMI (Dublin Core Metadata Initiative)

209

References

211

Content Processing

215

Natural Language Processing

217

Resources-Lexical Taxonomies

218

WordNet

219

WordNet Domains

220

Senso Comune

223

Tools

224

Tokenization

225

Language Recognition

226

Stopword Removal

227

Stemming

228

Suffix Stripping

229

Part-of-Speech Tagging

231

Rule-Based Approach

231

Word Sense Disambiguation

233

Lesk's Algorithm

235

Yarowsky's Algorithm

235

Parsing

236

Link Grammar

237

References

239

Information Management

241

Information Retrieval

241

Performance Evaluation

242

Indexing Techniques

244

Vector Space Model

244

Query Evaluation

247

Relevance Feedback

248

Dimensionality Reduction

249

Latent Semantic Analysis and Indexing

250

Concept Indexing

253

Image Retrieval

255

Keyword Extraction

257

TF-ITP

259

Naive Bayes

259

Co-occurrence

260

Text Categorization

262

A Semantic Approach Based on WordNet Domains

264

Information Extraction

265

WHISK

267

A Multistrategy Approach

269

The Semantic Web

271

References

272

Appendix A A Case Study: DOMINUS

274

General Framework

274

Actors and Workflow

274

Architecture

276

Functionality

278

Input Document Normalization

278

Layout Analysis

279

Kernel-Based Basic Blocks Grouping

280

Document Image Understanding

281

Categorization, Filing and Indexing

281

Prototype Implementation

282

Exploitation for Scientific Conference Management

285

GRAPE

286

Appendix B Machine Learning Notions

288

Categorization of Techniques

288

Noteworthy Techniques

289

Artificial Neural Networks

289

Decision Trees

290

k-Nearest Neighbor

290

Inductive Logic Programming

290

Naive Bayes

291

Hidden Markov Models

291

Clustering

291

Experimental Strategies

292

k-Fold Cross-Validation

292

Leave-One-Out

293

Random Split

293

Glossary

294

Bounding box

294

Byte ordering

294

Ceiling function

294

Chunk

294

Connected component

294

Heaviside unit function

294

Heterarchy

295

KL-divergence

295

Linear regression

295

Run

295

Scanline

295

References

296

Index

305