Categorizing Data

Whenever we speak of big data we immediately speak about structured,  partially structured and unstructured data. In this blog I will examine this broad categorization of data and attempt to provide a finer description of the categories.

I will use the following dimensions to give a finer description of the categories are:

  • Availability of meta-data that describes the data
  • Semantics of the data
  • Format of the data.
  • Content of the data

It is important to have a good understanding of the characterizations of the data as this will help in understand what information need to get out the data and how we can  utilize the extracted information.

Structured Data

This category of data has a well defined meta-data describing the content and the content is primarily scalar. I use scalar in a loose sense to mean the data content can be easily mapped into a appropriate typed data structure and can be manipulated (e.g. aggregation) and visualized.  Classic example includes relational data. Typically, the semantics of the data content  is established based on the context of data – example sales data, customer data, service data, order data etc.

The format of structured data is typically tabular, JSON, XML.

Partially Structured Data

This category of data typically either has explicit or  implicit meta-data  describing the content and the content can be a combination of scalar values  free form text. Examples include relational data with string column which has a user comment – service request transaction with a column for user comments.  Log data or machine generated data which hsas implicit meta-data whose content typically includes a combination of scalar values (example – IP Address, date), complex text strings encoding multiple pieces of key information (example – browser and OS info).  Typically, the semantics of the data content is established based on the context of the data.

The format of the partially structured data is mostly tabular combined with JSON and XML in the mix.

Unstructured Data

This category of data has typically has meta-data (implicit or explicit ) describing the content  which can have complex data (example word or PDF document or video or music) or has no meta-data (plain text file or music).

The content  in the former one is mainly non-scalar and can include text, binary, semi-structured and structured content.   This category is the most complex of all the categories with the data having a complex format.  If there is meta-data about the unstructured data then the first step would be to use it and extract the sub-categories and then focus on each sub-category.  If there is no meta-data the extraction will require proprietary knowledge of the data and  can be complex.

The content in the later one is typically small  large volumes of text which require manual interpretation of NLP to extract information.

Understanding the categories of data, their source has a profound impact in preparing the data for downstream analysis.

Advertisements

About atiru

Product Strategist and architect for harnessing value from data.
This entry was posted in Data Management and Analytics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s