Data Science

Decision Tree Building using Entropy¶

Decision Tree
- Internal Node -> Tests an attribute
- Branch -> attribute value
- Leaf Node -> Assigns a classification

Data Pre-processing¶

Data Cleaning
- Data cleaning routines work to “clean” the data by
  1. filling in missing values
  2. smooth- ing noisy data
  3. identifying or removing outliers
  4. resolving inconsistencies.
Data Integration
- Integrating multiple databases, data cubes, or files
Data Reduction
- Obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.
- Strategies
- Dimensionality Reduction
  - data encoding schemes are applied so as to obtain a reduced or “compressed” representation of the original data.
  - Compression Techniques -> Wavelet transforms, principal components analysis
  - Attribute subset selection -> Removes irrelevant attributes
  - attribute construction -> Small set of useful attributes derived from the original set of attributes.
- Numerosity Reduction.
  - using parametric models -> regression, log-linear models
  - Non parametric models -> Histogram, cluster, sampling,data aggregation

Data Cleaning¶

Problem -> Missing Values in the data
Approaches:
1. Ignore the tuple (Not Very Effective | Usually done with missing class label)
2. Fill in the missing value manually
3. Use global constant to fill in missing value (Not Foolproof | fill missing with "\(-\infty\)" or "Unknown")
4. Use a measure of central tendency for the attribute (e.g., the mean or median) [Mean for symmetric Data | Median for skewed ]
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple
6. Use the most probable value to fill in the missing value (regression or decision tree, etc.)
Noisy Data
- Noise is a random error or variance in a measured variable.
- Examples
  - Incorrect attribute values
  - Duplicate records
  - incomplete Data
  - inconsistent data
- Approaches
  1. Binning (sort & partition into 'equal frequency bins' or 'means' or 'boundaries' )
  2. Regression -> Linear / Multi linear
  3. Clustering -> Detect and remove outliers
  4. Combined computer and human inspection
Data Cleaning as a Process
1. Discrepancy Detection
  - Discrepancy caused by human error, data decay, deliberate errors, optional fields
  - Approach
    1. use Metadata for the data
    2. check field overloading
    3. Check for
      1. unique rule -> each value for specific attribute should be unique
      2. consecutive rule -> no missing values between lowest and highest
      3. Null rule -> how to depict null values and how to handle them
    4. Data scrubbing tools -> simple domain knowledge to detect errors and to correct
    5. Data Auditing Tools -> to find discrepancies by analysing data to discover rules and relationships of data
Data Migration and Integration
- Data Migration tools allow transformations to be specified
- ETL -> Extraction, Transformation, Loading tools allow users to specify transformations through a GUI