Data Science
Decision Tree Building using Entropy¶
- Decision Tree
- Internal Node -> Tests an attribute
- Branch -> attribute value
- Leaf Node -> Assigns a classification
Data Pre-processing¶
- Data Cleaning
- Data cleaning routines work to “clean” the data by
- filling in missing values
- smooth- ing noisy data
- identifying or removing outliers
- resolving inconsistencies.
- Data cleaning routines work to “clean” the data by
- Data Integration
- Integrating multiple databases, data cubes, or files
- Data Reduction
- Obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results.
- Strategies
- Dimensionality Reduction
- data encoding schemes are applied so as to obtain a reduced or “compressed” representation of the original data.
- Compression Techniques -> Wavelet transforms, principal components analysis
- Attribute subset selection -> Removes irrelevant attributes
- attribute construction -> Small set of useful attributes derived from the original set of attributes.
- Numerosity Reduction.
- using parametric models -> regression, log-linear models
- Non parametric models -> Histogram, cluster, sampling,data aggregation
Data Cleaning¶
- Problem -> Missing Values in the data
- Approaches:
- Ignore the tuple (Not Very Effective | Usually done with missing class label)
- Fill in the missing value manually
- Use global constant to fill in missing value (Not Foolproof | fill missing with "\(-\infty\)" or "Unknown")
- Use a measure of central tendency for the attribute (e.g., the mean or median) [
Mean
for symmetric Data |Median
for skewed ] - Use the attribute mean or median for all samples belonging to the same class as the given tuple
- Use the most probable value to fill in the missing value (regression or decision tree, etc.)
- Noisy Data
- Noise is a random error or variance in a measured variable.
- Examples
- Incorrect attribute values
- Duplicate records
- incomplete Data
- inconsistent data
- Approaches
- Binning (sort & partition into 'equal frequency bins' or 'means' or 'boundaries' )
- Regression -> Linear / Multi linear
- Clustering -> Detect and remove outliers
- Combined computer and human inspection
- Data Cleaning as a Process
- Discrepancy Detection
- Discrepancy caused by human error, data decay, deliberate errors, optional fields
- Approach
- use Metadata for the data
- check field overloading
- Check for
- unique rule -> each value for specific attribute should be unique
- consecutive rule -> no missing values between lowest and highest
- Null rule -> how to depict null values and how to handle them
- Data scrubbing tools -> simple domain knowledge to detect errors and to correct
- Data Auditing Tools -> to find discrepancies by analysing data to discover rules and relationships of data
- Discrepancy Detection
- Data Migration and Integration
- Data Migration tools allow transformations to be specified
- ETL -> Extraction, Transformation, Loading tools allow users to specify transformations through a GUI