Data Reduction in Data Mining


Data Reduction in Data Mining

In this tutorial, you’ll learn data reduction in data mining. You can easily learn data reduction in data mining.

The data reduction method may result in a simplified description of the original data that is much smaller in size but retains the original data’s quality.

Data reduction methods are described in the sections below.

Data Cube Aggregation:

This approach is used to group data into a more manageable format. As an example, consider the data you collected for your study from 2012 to 2014, which includes your company’s revenue every three months. They include you in annual sales rather than quarterly averages so that we can summarize the data in such a way that the resulting data summarizes overall sales per year rather than quarterly averages. It is a summary of the information.

Dimension reduction:

We use the attribute needed for our analysis whenever we come across data that is weakly significant. It shrinks data by eliminating obsolete or redundant functions.

Step-by-step forward selection – We start with an empty set of attributes and then determine which of the original attributes in the set are the best based on their relevance to other attributes. In statistics, we call it a p-value.

Assume the data set includes the following properties, with only a few of them being redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

Step-wise Backward Selection

This selection begins with a full collection of attributes in the original data and eliminates the worst remaining attribute in the set at each stage.
Assume the data set includes the following properties, with only a few of them being redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }
Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}

Combination of forwarding and backward selection – It helps us to eliminate the worst attributes and pick the better ones, saving time and speeding up the process.

Data Compression

The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression techniques.

Encoding techniques (Run Length Encoding) allow for a quick and minimal data size reduction using Lossless Compression. Algorithms are used in lossless data compression to recover the exact original data from the compressed data.

Lossy Compression – Examples of this compression include the Discrete Wavelet Transform Technique and PCA (principal component analysis). JPEG image format, for example, is a lossy compression, but we can find a sense equivalent to the original image. The decompressed data in lossy data compression which vary from the original data, but they are still useful for retrieving information.

Numerosity Reduction:

In this reduction method, real data is replaced with mathematical models or smaller representations of the data, with the model parameter being the only thing stored. Clustering, histograms, and sampling are examples of non-parametric methods. Visit the following link for more information on Numerosity Reduction:

Conceptualization & Discretization Hierarchy Operation

Data discretization techniques are used to separate continuous nature attributes into data with intervals. We use labels with small intervals to replace several of the attributes’ constant values. This ensures that mining findings are presented in a clear and concise manner.

Top-down discretization, also known as splitting, is characterized as dividing the entire set of attributes by first considering one or a few points (so-called breakpoints or split points) and then repeating this method until the entire set of attributes is divided.
Discretization from the bottom up –

Bottom-up discretization is when all the constant values are first considered as split-points, and then others are discarded by combining the neighborhood values in the interval.

Concept Hierarchies: It decreases the size of the data by gathering and then replacing low-level concepts (such as age) with higher-level concepts (categorical variables such as middle age or Senior).

The following techniques can be used for numeric data:

Binning is the transformation of numerical variables into categorical equivalents. The number of categorical equivalents is determined by the user’s selection of bins.
Analyzing histograms –

The histogram, like binning, is used to divide the value of the attribute X into disjoint ranges called brackets. There are a few partitioning laws to follow:

Partitioning values by their number of occurrences in the data set is known as equal frequency partitioning.

Partioning the values in a fixed distance depending on the number of bins, i.e. a range of numbers varying from 0 to 20.

Clustering is the method of grouping related data together.


Please enter your comment!
Please enter your name here