Jump to content

Document classification: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m En dash fix (via WP:JWB)
 
(40 intermediate revisions by 26 users not shown)
Line 1: Line 1:
{{Short description|Process of categorizing documents}}

'''Document classification''' or '''document categorization''' is a problem in [[library science]], [[information science]] and [[computer science]]. The task is to assign a [[document]] to one or more [[Class (philosophy)|classes]] or [[Categorization|categories]]. This may be done "manually" (or "intellectually") or [[algorithmically]]. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.
'''Document classification''' or '''document categorization''' is a problem in [[library science]], [[information science]] and [[computer science]]. The task is to assign a [[document]] to one or more [[Class (philosophy)|classes]] or [[Categorization|categories]]. This may be done "manually" (or "intellectually") or [[algorithmically]]. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.


The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.
The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, '''text classification''' is implied.


Documents may be classified according to their [[Subject (documents)|subjects]] or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.
Documents may be classified according to their [[Subject (documents)|subjects]] or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.
Line 8: Line 10:
'''Content-based classification''' is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned.<ref>Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the work.")</ref> In automatic classification it could be the number of times given words appears in a document.
'''Content-based classification''' is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned.<ref>Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the work.")</ref> In automatic classification it could be the number of times given words appears in a document.


'''Request-oriented classification''' (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p.&nbsp;230<ref>Soergel, Dagobert (1985). [https://books.google.com/books?id=cHbNCgAAQBAJ&printsec=frontcover#v=onepage&q&f=false Organizing information: Principles of data base and retrieval systems]. Orlando, FL: Academic Press.</ref>).
'''Request-oriented classification''' (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p.&nbsp;230<ref>Soergel, Dagobert (1985). [https://books.google.com/books?id=cHbNCgAAQBAJ Organizing information: Principles of data base and retrieval systems]. Orlando, FL: Academic Press.</ref>).


Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as ''policy-based classification'': The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.
Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as ''policy-based classification'': The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.


==Classification versus indexing==
==Classification versus indexing==
Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning [[Subject (documents)|subjects]] to documents ("[[subject indexing]]") but as [[Frederick Wilfrid Lancaster]] has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p.&nbsp;21<ref>Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.</ref>). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a [[thesaurus]] and vice versa (cf., Aitchison, 1986,<ref>Aitchison, J. (1986). "A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure." Journal of Documentation, Vol. 42 No. 3, pp. 160-181.</ref> 2004;<ref>Aitchison, J. (2004). "Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule." Bliss Classification Bulletin, Vol. 46, pp. 20-26.</ref> Broughton, 2008;<ref>Broughton, V. (2008). "[https://link.springer.com/article/10.1007/s10516-007-9027-7 A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification] (2nd Ed.).]" Axiomathes, Vol. 18 No.2, pp. 193-210.</ref> Riesthuis & Bliedung, 1991<ref>Riesthuis, G. J. A., & Bliedung, St. (1991). "Thesaurification of the UDC." Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.</ref>). Therefore, is the act of labeling a document (say by assigning a term from a [[controlled vocabulary]] to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).
Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning [[Subject (documents)|subjects]] to documents ("[[subject indexing]]") but as [[Frederick Wilfrid Lancaster]] has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p.&nbsp;21<ref>Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.</ref>). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a [[thesaurus]] and vice versa (cf., Aitchison, 1986,<ref>Aitchison, J. (1986). "A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure." Journal of Documentation, Vol. 42 No. 3, pp. 160-181.</ref> 2004;<ref>Aitchison, J. (2004). "Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule." Bliss Classification Bulletin, Vol. 46, pp. 20-26.</ref> Broughton, 2008;<ref>Broughton, V. (2008). "[https://link.springer.com/article/10.1007/s10516-007-9027-7 A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification] (2nd Ed.).]" Axiomathes, Vol. 18 No.2, pp. 193-210.</ref> Riesthuis & Bliedung, 1991<ref>Riesthuis, G. J. A., & Bliedung, St. (1991). "Thesaurification of the UDC." Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.</ref>). Therefore, the act of labeling a document (say by assigning a term from a [[controlled vocabulary]] to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that label.


==Automatic document classification (ADC)==
==Automatic document classification (ADC)==
Line 19: Line 21:
Rossi, R. G., Lopes, A. d. A., and Rezende, S. O. (2016). [https://www.sciencedirect.com/science/article/pii/S0306457315000990 Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts].
Rossi, R. G., Lopes, A. d. A., and Rezende, S. O. (2016). [https://www.sciencedirect.com/science/article/pii/S0306457315000990 Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts].
Information Processing & Management, 52(2):217–257.
Information Processing & Management, 52(2):217–257.
</ref> where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.<ref>[https://www.paralleldots.com/text-analysis-apis ParallelDots Document Classification APIs]</ref><ref>{{Cite web |url=https://pdfs.semanticscholar.org/bea4/a204239556a29228decc9e029c326e4900b7.pdf |title=An Interactive Automatic Document Classification Prototype |access-date=2017-11-14 |archive-url=https://web.archive.org/web/20171115082749/https://pdfs.semanticscholar.org/bea4/a204239556a29228decc9e029c326e4900b7.pdf |archive-date=2017-11-15 |url-status=dead }}</ref><ref>[https://seer.lcc.ufmg.br/index.php/jidm/article/download/43/41An Interactive Automatic Document Classification Prototype] {{webarchive |url=https://web.archive.org/web/20150424122349/https://seer.lcc.ufmg.br/index.php/jidm/article/download/43/41An |date=April 24, 2015 }}</ref><ref>[https://archive.is/20141208063727/http://www.artsyltech.com/da_classification.htmlAutomatic Document Classification - Artsyl]</ref><ref>[http://www.abbyy.com/ocr_sdk_windows/what_is_new/classification/ ABBYY FineReader Engine 11 for Windows]</ref><ref>[http://www.antidot.net/classifier/ Classifier - Antidot]</ref>
</ref> where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.<ref>{{Cite web |url=https://pdfs.semanticscholar.org/bea4/a204239556a29228decc9e029c326e4900b7.pdf |title=An Interactive Automatic Document Classification Prototype |access-date=2017-11-14 |archive-url=https://web.archive.org/web/20171115082749/https://pdfs.semanticscholar.org/bea4/a204239556a29228decc9e029c326e4900b7.pdf |archive-date=2017-11-15 |url-status=dead }}</ref><ref>[https://seer.lcc.ufmg.br/index.php/jidm/article/download/43/41An Interactive Automatic Document Classification Prototype] {{webarchive |url=https://web.archive.org/web/20150424122349/https://seer.lcc.ufmg.br/index.php/jidm/article/download/43/41An |date=April 24, 2015 }}</ref><ref>[https://archive.today/20141208063727/http://www.artsyltech.com/da_classification.htmlAutomatic Document Classification - Artsyl]</ref><ref>[http://www.abbyy.com/ocr_sdk_windows/what_is_new/classification/ ABBYY FineReader Engine 11 for Windows]</ref><ref>[http://www.antidot.net/classifier/ Classifier - Antidot]</ref><ref>{{Cite web|title=3 Document Classification Methods for Tough Projects|url=https://www.bisok.com/grooper-data-capture-method-features/document-classification/|access-date=2021-08-04|website=www.bisok.com|language=en-US}}</ref>


=== Techniques ===
=== Techniques ===
Automatic document classification techniques include:
Automatic document classification techniques include:

* [[Artificial neural network]]
* [[Concept Mining]]
* [[Decision tree learning|Decision trees]] such as [[ID3 algorithm|ID3]] or [[C4.5 algorithm|C4.5]]
* [[Expectation maximization]] (EM)
* [[Expectation maximization]] (EM)
* [[Naive Bayes classifier]]
* [[tf–idf]]
* [[Instantaneously trained neural networks]]
* [[Instantaneously trained neural networks]]
* [[Latent semantic indexing]]
* [[Latent semantic indexing]]
* [[Multiple-instance learning]]
* [[Naive Bayes classifier]]
* [[Natural language processing]] approaches
* [[Rough set]]-based classifier
* [[Soft set]]-based classifier
* [[Support vector machines]] (SVM)
* [[Support vector machines]] (SVM)
* [[Artificial neural network]]
* [[k-nearest neighbor algorithm|K-nearest neighbour algorithms]]
* [[k-nearest neighbor algorithm|K-nearest neighbour algorithms]]
* [[tf–idf]]
* [[Decision tree learning|Decision trees]] such as [[ID3 algorithm|ID3]] or [[C4.5 algorithm|C4.5]]
* [[Concept Mining]]
* [[Rough set]]-based classifier
* [[Soft set]]-based classifier
* [[Multiple-instance learning]]
* [[Natural language processing]] approaches


== Applications ==
== Applications ==
Classification techniques have been applied to
Classification techniques have been applied to
* [[spam filter]]ing, a process which tries to discern [[E-mail spam]] messages from legitimate emails
* [[spam filter]]ing, a process which tries to discern [[E-mail spam]] messages from legitimate emails
* email [[routing]], sending an email sent to a general address to a specific address or mailbox depending on topic<ref>Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). [https://arxiv.org/pdf/cs/0003060 Message classification in the call center]. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158-165, ACL.</ref>
* email [[routing]], sending an email sent to a general address to a specific address or mailbox depending on topic<ref>Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). [https://arxiv.org/abs/cs/0003060 Message classification in the call center]. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158–165, ACL.</ref>
* [[language identification]], automatically determining the language of a text
* [[language identification]], automatically determining the language of a text
* genre classification, automatically determining the genre of a text<ref>{{Citation|last = Santini| first = Marina | last2 = Rosso| first2 = Mark| title = Testing a Genre-Enabled Application: A Preliminary Assessment| url = http://www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf| series = BCS IRSG Symposium: Future Directions in Information Access| place = London, UK | pages= 54–63| year = 2008 }}</ref>
* genre classification, automatically determining the genre of a text<ref>{{Citation| last1 = Santini| first1 = Marina| last2 = Rosso| first2 = Mark| title = Testing a Genre-Enabled Application: A Preliminary Assessment| url = http://www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf| series = BCS IRSG Symposium: Future Directions in Information Access| place = London, UK| pages = 54–63| year = 2008| access-date = 2011-10-21| archive-date = 2019-11-15| archive-url = https://web.archive.org/web/20191115061125/https://www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf| url-status = dead}}</ref>
* [[Readability|readability assessment]], automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger [[text simplification]] system
* [[Readability|readability assessment]], automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger [[text simplification]] system
* [[sentiment analysis]], determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
* [[sentiment analysis]], determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
* health-related classification using social media in public health surveillance <ref>X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7.
* health-related classification using social media in public health surveillance <ref>X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7.
{{DOI|10.1109/SECON.2017.7925400}}</ref>
{{doi|10.1109/SECON.2017.7925400}}</ref>
* article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology.<ref name=":0">{{Cite journal
* article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology <ref name=":0">{{Cite journal
| pmid = 18834495
| pmid = 18834495
| year = 2008
| year = 2008
Line 55: Line 58:
| title = Overview of the protein-protein interaction annotation extraction task of Bio ''Creative'' II
| title = Overview of the protein-protein interaction annotation extraction task of Bio ''Creative'' II
| journal = Genome Biology
| journal = Genome Biology
| volume = 9 Suppl 2
| volume = 9
| pages = S4
| pages = S4
| last2 = Leitner
| last2 = Leitner
Line 63: Line 66:
| last4 = Valencia
| last4 = Valencia
| first4 = A
| first4 = A
| issue = Suppl 2
| doi = 10.1186/gb-2008-9-s2-s4
| doi = 10.1186/gb-2008-9-s2-s4
| pmc = 2559988
| pmc = 2559988
| doi-access = free
}}</ref>
}}</ref>


Line 74: Line 79:
* [[Concept-based image indexing]]
* [[Concept-based image indexing]]
* [[Content-based image retrieval]]
* [[Content-based image retrieval]]
* [[Decimal section numbering]]
* [[Document]]
* [[Document]]
* [[Supervised learning]], [[unsupervised learning]]
* [[Document retrieval]]
* [[Document retrieval]]
* [[Document clustering]]
* [[Document clustering]]
Line 87: Line 92:
* [[Subject (documents)]]
* [[Subject (documents)]]
* [[Subject indexing]]
* [[Subject indexing]]
* [[Supervised learning]], [[unsupervised learning]]
* [[Text mining]], [[web mining]], [[concept mining]]
* [[Text mining]], [[web mining]], [[concept mining]]
{{colend}}
{{colend}}

== Further reading ==
* Fabrizio Sebastiani. [https://arxiv.org/pdf/cs.ir/0110053 Machine learning in automated text categorization]. ACM Computing Surveys, 34(1):1–47, 2002.
* Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. [http://www.ir.uwaterloo.ca/book/ Information Retrieval: Implementing and Evaluating Search Engines]. MIT Press, 2010.


==References==
==References==
{{Reflist}}
{{Reflist}}

== Further reading ==
* Fabrizio Sebastiani. [https://arxiv.org/abs/cs.ir/0110053 Machine learning in automated text categorization]. ACM Computing Surveys, 34(1):1–47, 2002.
* Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. [http://www.ir.uwaterloo.ca/book/ Information Retrieval: Implementing and Evaluating Search Engines] {{Webarchive|url=https://web.archive.org/web/20201005195805/http://www.ir.uwaterloo.ca/book/ |date=2020-10-05 }}. MIT Press, 2010.


== External links ==
== External links ==
* [https://web.archive.org/web/20070613200617/http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node11.html Introduction to document classification]
* [https://web.archive.org/web/20070613200617/http://isp.imm.dtu.dk/thor/projects/multimedia/textmining/node11.html Introduction to document classification]
* [https://www.cs.technion.ac.il/~gabr/resources/atc/atcbib.html Bibliography on Automated Text Categorization]
* [https://www.cs.technion.ac.il/~gabr/resources/atc/atcbib.html Bibliography on Automated Text Categorization] {{Webarchive|url=https://web.archive.org/web/20190926230122/http://www.cs.technion.ac.il/~gabr/resources/atc/atcbib.html |date=2019-09-26 }}
* [http://liinwww.ira.uka.de/bibliography/Ai/query-classification.html Bibliography on Query Classification]
* [http://liinwww.ira.uka.de/bibliography/Ai/query-classification.html Bibliography on Query Classification] {{Webarchive|url=https://web.archive.org/web/20191002205729/http://liinwww.ira.uka.de/bibliography/Ai/query-classification.html |date=2019-10-02 }}
* [http://www.gabormelli.com/RKB/Text_Classification_Task Text Classification] analysis page
* [http://www.gabormelli.com/RKB/Text_Classification_Task Text Classification] analysis page
* [http://www.nltk.org/book/ch06.html Learning to Classify Text - Chap. 6 of the book Natural Language Processing with Python] (available online)
* [http://www.nltk.org/book/ch06.html Learning to Classify Text - Chap. 6 of the book Natural Language Processing with Python] (available online)
* [http://techtc.cs.technion.ac.il TechTC - Technion Repository of Text Categorization Datasets]
* [http://techtc.cs.technion.ac.il TechTC - Technion Repository of Text Categorization Datasets] {{Webarchive|url=https://web.archive.org/web/20200214072558/http://techtc.cs.technion.ac.il/ |date=2020-02-14 }}
* [http://www.daviddlewis.com/resources/testcollections/ David D. Lewis's Datasets]
* [http://www.daviddlewis.com/resources/testcollections/ David D. Lewis's Datasets]
* [http://www.biocreative.org/tasks/biocreative-iii/ppi/ BioCreative III ACT (article classification task) dataset]
* [http://www.biocreative.org/tasks/biocreative-iii/ppi/ BioCreative III ACT (article classification task) dataset]


{{Natural Language Processing}}

[[Category:Data mining]]
[[Category:Information science]]
[[Category:Information science]]
[[Category:Natural language processing]]
[[Category:Knowledge representation]]
[[Category:Knowledge representation]]
[[Category:Data mining]]
[[Category:Machine learning]]
[[Category:Machine learning]]
[[Category:Natural language processing]]

Latest revision as of 10:53, 4 May 2024

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.

Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach.

"Content-based" versus "request-based" classification

[edit]

Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned.[1] In automatic classification it could be the number of times given words appears in a document.

Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230[2]).

Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as policy-based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach.

Classification versus indexing

[edit]

Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21[3]). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986,[4] 2004;[5] Broughton, 2008;[6] Riesthuis & Bliedung, 1991[7]). Therefore, the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) is at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). In other words, labeling a document is the same as assigning it to the class of documents indexed under that label.

Automatic document classification (ADC)

[edit]

Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification,[8] where parts of the documents are labeled by the external mechanism. There are several software products under various license models available.[9][10][11][12][13][14]

Techniques

[edit]

Automatic document classification techniques include:

Applications

[edit]

Classification techniques have been applied to

  • spam filtering, a process which tries to discern E-mail spam messages from legitimate emails
  • email routing, sending an email sent to a general address to a specific address or mailbox depending on topic[15]
  • language identification, automatically determining the language of a text
  • genre classification, automatically determining the genre of a text[16]
  • readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system
  • sentiment analysis, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.
  • health-related classification using social media in public health surveillance [17]
  • article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology [18]

See also

[edit]

References

[edit]
  1. ^ Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the work.")
  2. ^ Soergel, Dagobert (1985). Organizing information: Principles of data base and retrieval systems. Orlando, FL: Academic Press.
  3. ^ Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.
  4. ^ Aitchison, J. (1986). "A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure." Journal of Documentation, Vol. 42 No. 3, pp. 160-181.
  5. ^ Aitchison, J. (2004). "Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule." Bliss Classification Bulletin, Vol. 46, pp. 20-26.
  6. ^ Broughton, V. (2008). "A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification (2nd Ed.).]" Axiomathes, Vol. 18 No.2, pp. 193-210.
  7. ^ Riesthuis, G. J. A., & Bliedung, St. (1991). "Thesaurification of the UDC." Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.
  8. ^ Rossi, R. G., Lopes, A. d. A., and Rezende, S. O. (2016). Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. Information Processing & Management, 52(2):217–257.
  9. ^ "An Interactive Automatic Document Classification Prototype" (PDF). Archived from the original (PDF) on 2017-11-15. Retrieved 2017-11-14.
  10. ^ Interactive Automatic Document Classification Prototype Archived April 24, 2015, at the Wayback Machine
  11. ^ Document Classification - Artsyl
  12. ^ ABBYY FineReader Engine 11 for Windows
  13. ^ Classifier - Antidot
  14. ^ "3 Document Classification Methods for Tough Projects". www.bisok.com. Retrieved 2021-08-04.
  15. ^ Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). Message classification in the call center. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158–165, ACL.
  16. ^ Santini, Marina; Rosso, Mark (2008), Testing a Genre-Enabled Application: A Preliminary Assessment (PDF), BCS IRSG Symposium: Future Directions in Information Access, London, UK, pp. 54–63, archived from the original (PDF) on 2019-11-15, retrieved 2011-10-21{{citation}}: CS1 maint: location missing publisher (link)
  17. ^ X. Dai, M. Bikdash and B. Meyer, "From social media to public health surveillance: Word embedding based clustering method for twitter classification," SoutheastCon 2017, Charlotte, NC, 2017, pp. 1-7. doi:10.1109/SECON.2017.7925400
  18. ^ Krallinger, M; Leitner, F; Rodriguez-Penagos, C; Valencia, A (2008). "Overview of the protein-protein interaction annotation extraction task of Bio Creative II". Genome Biology. 9 (Suppl 2): S4. doi:10.1186/gb-2008-9-s2-s4. PMC 2559988. PMID 18834495.

Further reading

[edit]
[edit]