machine learning
Machine Learning Applications for Document Classification

Machine learning is being applied to many difficult problems in the advanced analytics arena.  A current application of interest is in document classification, where the organizing and editing of documents is currently very manual.  To accomplish such a feat, heavy use of text mining on unstructured data is needed to first parse and categorize information.

Even the earliest applications of Business Intelligence leaned heavily on categorizing unstructured text data: in the 1958 IBM Journal article, A Business Intelligence System, H.P. Luhn writes

“This intelligence system will utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the ‘action points’ in an organization.”

Digital text analytics has been emerging since 1990s, when Professor Ronen Feldman coined the term “text mining”, and has grown rapidly in recent years.  Early applications include fraud detection, government intelligence, and bioinformatics – putting research into practice. Classifying documents – from books, to news articles, to blogs, to legal papers – into categories with similar themes or topics is critical for their future reference.  With the exponential growth in the volume of digital documents, both online and within organizations, automated document classification has become increasingly desirable and necessary within the last decade.

Once a taxonomy for documents has been established, automating the process of assigning uncategorized documents (whether digital or print) into one or more categories is a classic example of supervised learning. This is a machine learning task that assesses each unit that is to be assigned based on its inherent characteristics, and the target is a list of predefined categories, classes, or labels – comprising a set of “right answers” to which an input (here, a text document) can be mapped.

Well-known methods for supervised learning include 

  • Logistic regression, a predictive modeling technique where the outcomes are (typically) binary categories. Propensity models, such as churn, likelihood-to-buy, or customer segments, are great use cases for LR and are an Aspirent specialty.
  • Linear regression, to predict continuous outcomes such as sales volume or customer claims
  • Naïve Bayes, a family of probabilistic classifiers derived from Bayes’ Theorem
  • Nonparametric methods such as Support Vector Machines, linear discriminant analysis, Decision Trees, K-nearest neighbor, and Artificial Neural Networks.

These methods have unlimited practical applications, especially in today’s web-based world.  Some familiar ones are:

  • News classification (Politics, Sports, Entertainment, Editorial)
  • E-commerce categorization (in Apparel: Womens’, Mens’, Kids’, Shoes, Dresses, Tops, Shorts, Accessories, sizes, styles, colors, etc.)
  • Search result ranking (reinforcement and semi-supervised learning) – most relevant results “on top” based on what others clicked on after searching for similar terms
  • Recommendation engines – Amazon’s “other customers ultimately bought”; “Suggested for you”
  • Spam detection in email filters

In contrast, in Unsupervised learning – there is no “right answer”.  We have no target category or class in which to place a piece of data, or document.  The computer finds natural similarities between documents or data points, and creates groupings. The machine “learns” as each new data point is compared to the emerging groupings, and categories are refined iteratively.

The most common Unsupervised learning technique is Cluster Analysis, which we use often for building data-driven market segments for our clients. It has broad applications in text classification as well.  There are a wide range of methods for Unsupervised Learning as well: Self-organizing maps, Principal Component & Factor analysis (used for statistical variable reduction), Probabilistic Neural Networks, and more.

Cool uses, both fun and practical, for Unsupervised Learning have skyrocketed with the abundance of digital data…

  • Social network conversations
  • Targeted marketing based on geography and online browsing
  • Image recognition/reverse image search
  • Facial recognition
  • Sentiment analysis (NLP)
  • Fraud detection

…and bring us ever-closer to true Artificial Intelligence.  We have begun our journey to make strides in this space, taking machine learning applications to the next level by not only classifying text, but training the machine to understand and interpret the INTENT that lies deeper. The study of Sentiment Analysis gets part of the way there, by labeling content as positive, negative, or neutral. But, there still exist major gaps in understanding tone, context, and relevancy. In other words, by training the computer to understand intent, we can teach it to not only categorize documents and their component parts, but to edit their content to keep them relevant and up-to-date.

Automating the process of document editing

Classifying a full, multi-page document is more complex than, say, a comment on a social network or blog post, because it is more likely to contain a mixture of themes. A many-to-many relationship often exists between documents and classifications.  If we think of, say, a corporate policy listing as a set of mini “documents”, we can start to – assisted by tagging with metadata – classify, reference, and change its component parts separately.

We have noticed that an area currently lacking in automation is in the editing of official documents as policies change. A rigid, supervised classification structure for documentation may become obsolete and require greater fluidity over time – necessitating a more unsupervised learning approach. Even with recent major digital advances, organizations still employ teams of people to perform the tedious tasks of manually reading, interpreting, and updating documents.

For example, when researchers make a breakthrough in medicine, hundreds or thousands of existing medical documentation texts are impacted.  This means text resources for, say, a hospital, medical school, or physicians’ practice, could be in conflict until everything is updated with the new research. By surmounting the machine learning task of understanding intent and context of a newly-documented piece of research, we will be able to automate the updating of all related text resources to include the new findings.

Cracking the nut and automating this process has the capability for huge advancements in

  • Local government coding and ordinances
  • Legal documents
  • Healthcare
  • Corporate and government policies
  • Education – Textbooks
  • And much, much more.

The problem, ripe for solving via Machine Learning, has many applications.  Solving it will rely on principles of text classification, layered with supervised and unsupervised machine learning. Subscribe to stay in touch as we continue on this journey!


By: Amanda Hand | aspirent |