Dataset of adult income
</br> DataSet Overveiw

  • Each row is labelled as either having a salary greater than ">50K" or "<=50K".
  • This Data set is split into two CSV files, named adult-training.csv and adult-test.csv.
    </br> To Build a binary classifier on the training dataset to predict the column income_bracket which has two possible values ">50K" and "<=50K" and evaluate the accuracy of the classifier with the test dataset.
  • categorical_columns = [workclass, education, marital_status, occupation, relationship, race, gender, native_country]
  • continuous_columns = [age, education_num, capital_gain, capital_loss, hours_per_week]
  • A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

train_file = "adult-training.csv"

columns = ['Age','Workclass','fnlgwt','Education','Education_num','Marital_Status',
           'Occupation','Relationship','Race','Sex','Capital_Gain','Capital_Loss',
           'Hours/Week','Native_country','Income']

#collapse-hide

train = pd.read_csv(train_file, names=columns)
train.head()
Age Workclass fnlgwt Education Education_num Marital_Status Occupation Relationship Race Sex Capital_Gain Capital_Loss Hours/Week Native_country Income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             32561 non-null  int64 
 1   Workclass       32561 non-null  object
 2   fnlgwt          32561 non-null  int64 
 3   Education       32561 non-null  object
 4   Education_num   32561 non-null  int64 
 5   Marital_Status  32561 non-null  object
 6   Occupation      32561 non-null  object
 7   Relationship    32561 non-null  object
 8   Race            32561 non-null  object
 9   Sex             32561 non-null  object
 10  Capital_Gain    32561 non-null  int64 
 11  Capital_Loss    32561 non-null  int64 
 12  Hours/Week      32561 non-null  int64 
 13  Native_country  32561 non-null  object
 14  Income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
train.shape
(32561, 15)
train.describe()
Age fnlgwt Education_num Capital_Gain Capital_Loss Hours/Week
count 32561.000000 3.256100e+04 32561.000000 32561.000000 32561.000000 32561.000000
mean 38.581647 1.897784e+05 10.080679 1077.648844 87.303830 40.437456
std 13.640433 1.055500e+05 2.572720 7385.292085 402.960219 12.347429
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.178270e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.783560e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.370510e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.484705e+06 16.000000 99999.000000 4356.000000 99.000000
# Replacing '?' with nan 
train.replace(' ?', np.nan, inplace=True)
train.isnull().sum()
Age                  0
Workclass         1836
fnlgwt               0
Education            0
Education_num        0
Marital_Status       0
Occupation        1843
Relationship         0
Race                 0
Sex                  0
Capital_Gain         0
Capital_Loss         0
Hours/Week           0
Native_country     583
Income               0
dtype: int64

Missing Data:

Workclass(1836), Occupation(1843), Native_country(583)
</br>

Important: All the missing data belongs to Categorical data
train['Income'].value_counts()
 <=50K    24720
 >50K      7841
Name: Income, dtype: int64
sns.countplot(train['Income'])
plt.title("Count of Income Category")
plt.show()

Warning: Dataset is Imbalenced with Majority class label <=50k.
- 75.91% data points labeled <=50k
  • 24.08% data points labeled <50k
# Gender distribution
sns.countplot(train['Sex'])
plt.title("Count of Sex Category")
plt.show()
sns.stripplot(x='Sex', y='Hours/Week', data=train,hue='Income',marker='X')
<matplotlib.axes._subplots.AxesSubplot at 0x19ac5a67ec8>
# Workclass
wclass_plot = sns.countplot(train['Workclass'])
wclass_plot.set_xticklabels( wclass_plot.get_xticklabels(),rotation=50, ha="right")
plt.title("Count Plot of Workclass")
Text(0.5, 1.0, 'Count Plot of Workclass')

Private class working people are overall High in count

train['Education'].value_counts()
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: Education, dtype: int64
# Occupation
occ_plot = sns.countplot(train['Occupation'])
occ_plot.set_xticklabels(occ_plot.get_xticklabels(), rotation=40, ha="right")
plt.title("Count Plot of Occupation")
Text(0.5, 1.0, 'Count Plot of Occupation')
fig, axs = plt.subplots(ncols=2, nrows=4, figsize=(20, 20))
plt.subplots_adjust(hspace=0.68)
fig.delaxes(axs[3][1])
fig.suptitle('Subplot of Various Categorical Variables') 

# Workclass
wc_plot = sns.countplot(train['Workclass'], ax=axs[0][0])
wc_plot.set_xticklabels(wc_plot.get_xticklabels(), rotation=40, ha="right")
axs[0][0].title.set_text('Count Plot of Workclass')

# Native country
nc_plot = sns.countplot(train['Native_country'], ax=axs[0][1])
nc_plot.set_xticklabels(nc_plot.get_xticklabels(), rotation=72, ha="right")
axs[0][1].title.set_text('Count plot of Native_country')

# Education
ed_plot = sns.countplot(train['Education'], ax=axs[1][0])
ed_plot.set_xticklabels(ed_plot.get_xticklabels(), rotation=40, ha="right")
axs[1][0].title.set_text('Count Plot of Education')

# Marital status
ms_plot = sns.countplot(train['Marital_Status'], ax=axs[1][1])
ms_plot.set_xticklabels(ms_plot.get_xticklabels(), rotation=40, ha="right")
axs[1][1].title.set_text('Count Plot of Martial Status')

# Relationship
rel_plot = sns.countplot(train['Relationship'], ax=axs[2][0])
rel_plot.set_xticklabels(rel_plot.get_xticklabels(), rotation=40, ha="right")
axs[2][0].title.set_text('Count Plot of Relationship')

# Race
race_plot = sns.countplot(train['Race'], ax=axs[2][1])
race_plot.set_xticklabels(race_plot.get_xticklabels(), rotation=40, ha="right")
axs[2][1].title.set_text('Count Plot of Race')

# Occupation
occ_plot = sns.countplot(train['Occupation'], ax=axs[3][0])
occ_plot.set_xticklabels(occ_plot.get_xticklabels(), rotation=40, ha="right")
axs[3][0].title.set_text('Count Plot of Occupation')

Note: #### Majority count aggrigation in each column:
- Workclass:
- Private : 22696
  • Native_country:
    • United-States : 29170
  • Education:
    • Hs-grad : 10501
  • Marital_Status:
    • Married-civ-spouse : 14976
  • Relationship
    • Husband : 13193
  • Race
    • White : 27816
  • Occupation:
    • Prof-specialty : 4140

Note: #### Minority count aggrigation in each column:
- Workclass:
- Never-worked : 7
  • Native_country:
    • Holand-Netherlands : 1
  • Education:
    • Preschool : 51
  • Marital_Status:
    • Married-AF-spouse : 23
  • Relationship
    • Other-relative : 981
  • Race:
    • other : 271
  • Occupation:
    • Armed-Forces : 9
plt.figure(figsize=(20, 6))
sns.countplot(train['Marital_Status'], hue=train['Income'])
plt.title("Count Plot of Maritial Status with Hue Income")
Text(0.5, 1.0, 'Count Plot of Maritial Status with Hue Income')

Most of the Never Married people are under Income of <=50k

plt.figure(figsize=(20, 6))
sns.countplot(train['Relationship'], hue=train['Income'])
plt.title("Count Plot of Relationship with Hue Income")
Text(0.5, 1.0, 'Count Plot of Relationship with Hue Income')
plt.figure(figsize=(20, 6))
sns.countplot(train['Age'], hue=train['Income'])
plt.title("Count Plot of Age with Hue Income")
Text(0.5, 1.0, 'Count Plot of Age with Hue Income')
sns.set_style("whitegrid")
sns.pairplot(train, hue="Income", size=3)
plt.show()

#collapse-hide

# Age with Income
sns.FacetGrid(train, hue="Income", size=6) \
   .map(sns.distplot, "Age") \
   .add_legend();
plt.show();

# Education_num with Education_num
sns.FacetGrid(train, hue="Income", size=6) \
   .map(sns.distplot, "Education_num") \
   .add_legend();
plt.show();

# Education_num with Capital_Gain
sns.FacetGrid(train, hue="Income", size=7) \
   .map(sns.distplot, "Capital_Gain") \
   .add_legend();
plt.show();

# Education_num with Capital_Loss
sns.FacetGrid(train, hue="Income", size=7) \
   .map(sns.distplot, "Capital_Loss") \
   .add_legend();
plt.show();

# Education_num with Hours/Week
sns.FacetGrid(train, hue="Income", size=7) \
   .map(sns.distplot, "Hours/Week") \
   .add_legend();
plt.show();

[Report]Univariate Analysis

Dataset is Imbalenced with Majority class label <=50k.

  • 75.91% data points labeled <=50k
  • 24.08% data points labeled <50k

Missing Data:

Workclass(1836), Occupation(1843), Native_country(583)
</br> All belongs to Categorical data

  1. Workclass
    • Majority:
      • Private Class, 22696
    • Minority:
      • Never-worked, 7
      • Without-pay, 14
      • Federal-gov, 960
  2. Native Country
    • Majority:
      • United-States, 29170
    • Minority:
      • Holand-Netherlands, 1
      • Scotland, 12
    • Missing Data:
      • ?, 583
  3. Education
    • Majority:
      • HS-grad, 10501
      • Some-college, 7291
      • Bachelors, 5355
    • Minority:
      • Preschool, 51
      • 1st-4th, 168
      • 5th-6th, 333
  4. Martial Status
    • Majority:
      • Married-civ-spouse, 14976
      • Never-married, 10683
      • Divorced, 4443
    • Minority:
      • Married-AF-spouse, 23
      • Married-spouse-absent, 418
  5. Relationship
    • Majority:
      • Husband, 13193
      • Not-in-family, 8305
    • Minority:
      • Other-relative, 981
      • Wife, 1568
  6. Race
    • Majority:
      • White, 27816
      • Black, 3124
    • Minority:
      • Other, 271
      • Amer-Indian-Eskimo, 311
  7. Occupation
    • Majority:
      • Prof-specialty, 4140
      • Craft-repair, 4099
      • Exec-managerial, 4066
    • Minority:
      • Armed-Forces, 9
      • Priv-house-serv, 149
      • Protective-serv, 649
    • Missing Data:
      • ?, 1843

Majority count aggrigation in each column:

  • Workclass:
    • Private : 22696
  • Native_country:
    • United-States : 29170
  • Education:
    • Hs-grad : 10501
  • Marital_Status:
    • Married-civ-spouse : 14976
  • Relationship
    • Husband : 13193
  • Race
    • White : 27816
  • Occupation:
    • Prof-specialty : 4140

Minority count aggrigation in each column:

  • Workclass:
    • Never-worked : 7
  • Native_country:
    • Holand-Netherlands : 1
  • Education:
    • Preschool : 51
  • Marital_Status:
    • Married-AF-spouse : 23
  • Relationship
    • Other-relative : 981
  • Race:
    • other : 271
  • Occupation:
    • Armed-Forces : 9
train_df = pd.read_csv("adult-training.csv", names=columns)
# Repalcing '?' to nan
#train_df.replace(' ?', np.nan, inplace=True)

Bivariate Analysis

Questions:

  • Which workclass people are earning the most?
  • Which level of educated people are earning the most?
  • Which martial category people are earning the most?
  • people form which occupation category are earning the most?
  • People form wich relation category are earning the most?
  • Which gender people are earning the most?
  • Which Race of people are earning the most?
  • People belongs to which Native country are earning the most?

Income

changing Income into 0's and 1's

train_df['Income'] = train['Income'].apply(lambda x: 1 if x==' >50K' else 0)

Workclass

Replaceing NaNs with 0s

train_df['Workclass'].fillna(' 0', inplace=True)
sns.factorplot(x="Workclass", y="Income", data=train_df, kind="bar", size = 6, 
palette = "muted")
plt.xticks(rotation=45);
plt.title("Bar plot of Work Class VS Income")
Text(0.5, 1, 'Bar plot of Work Class VS Income')

People from Self-emp-inc are earning the most

Education

sns.factorplot(x="Education",y="Income",data=train_df,kind="bar", size = 7, 
palette = "muted")
plt.xticks(rotation=60);
plt.title("Bar plot of Education VS Income")
Text(0.5, 1, 'Bar plot of Education VS Income')

All the Grade Education can be combined in to Primary as a single feature
</br> ref: https://www.kaggle.com/kost13/us-income-logistic-regression/comments

def primary(x):
    if x in [' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th']:
        return ' Primary'
    else:
        return x
train_df['Education'] = train_df['Education'].apply(primary)
sns.factorplot(x="Education", y="Income", data=train_df, kind="bar", size=7,
              palette="muted")
plt.xticks(rotation=60);

Combinded [' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th'] to single feature Primary

  • Doctorates and Prof-school people has Hihger Income >50k

Education num

sns.factorplot(x="Education_num",y="Income",data=train_df,kind="bar", size = 6, 
palette = "muted")
plt.xticks(rotation=60);
plt.title("Factorplot of Education VS Income")
Text(0.5, 1, 'Factorplot of Education VS Income')

Relation Higher the Education_num give better Income

Martial Status

sns.factorplot(x="Marital_Status",y="Income",data=train_df,kind="bar", size = 5, 
palette = "muted")
plt.xticks(rotation=60);

print(train_df['Marital_Status'].value_counts())
plt.title("Factor plot of Martial Status VS Income")
 Married-civ-spouse       14976
 Never-married            10683
 Divorced                  4443
 Separated                 1025
 Widowed                    993
 Married-spouse-absent      418
 Married-AF-spouse           23
Name: Marital_Status, dtype: int64
Text(0.5, 1, 'Factor plot of Martial Status VS Income')

People belonging to Married-civ-spouse are earning the most.

Occupation

#filing NaNs in Occupation with 0
train_df['Occupation'].replace(' ?', ' 0', inplace=True)
train_df['Occupation'].value_counts()
 Prof-specialty       4140
 Craft-repair         4099
 Exec-managerial      4066
 Adm-clerical         3770
 Sales                3650
 Other-service        3295
 Machine-op-inspct    2002
 0                    1843
 Transport-moving     1597
 Handlers-cleaners    1370
 Farming-fishing       994
 Tech-support          928
 Protective-serv       649
 Priv-house-serv       149
 Armed-Forces            9
Name: Occupation, dtype: int64
sns.factorplot(x="Occupation",y="Income",data=train_df,kind="bar", size = 8, 
palette = "muted")
plt.xticks(rotation=60);
plt.title("Factor plot of Occupation VS Income")
Text(0.5, 1, 'Factor plot of Occupation VS Income')

people belonging to Exec-managerial occupation are earning the most

Relationship

sns.factorplot(x="Relationship", y="Income", data=train_df, size=5, kind="bar",
palette="muted")
plt.xticks(rotation=60)
plt.title("Factorplot of Relationship vs Income")
print(train_df['Relationship'].value_counts())
 Husband           13193
 Not-in-family      8305
 Own-child          5068
 Unmarried          3446
 Wife               1568
 Other-relative      981
Name: Relationship, dtype: int64

People belonging to wife category of relationship are earning the most

Race

sns.factorplot(x="Race", y="Income", data=train_df, size=5, kind="bar",
palette="muted")
plt.xticks(rotation=60)
plt.title("Factorplot of Race VS Income")
print(train_df['Race'].value_counts())
 White                 27816
 Black                  3124
 Asian-Pac-Islander     1039
 Amer-Indian-Eskimo      311
 Other                   271
Name: Race, dtype: int64

People belonging to Asian-Pac-Islander are earning the most in Race

sex

sns.factorplot(x="Sex", y="Income", data=train_df,size=5,kind="bar",
palette="muted")
plt.xticks(rotation=60)
plt.title("Factorplot of Sex VS Income")

print(train_df['Sex'].value_counts())
 Male      21790
 Female    10771
Name: Sex, dtype: int64

Male gender are earning the most

Native country

There Exist 583 Unknown values replacing with 0

train_df['Native_country'].replace(' ?', ' 0', inplace=True)

#collapse-hide

sns.factorplot(x="Native_country", y="Income", data=train_df,size=13,kind="bar",
palette="muted")
plt.xticks(rotation=80)

print(train_df['Native_country'].value_counts())
 United-States                 29170
 Mexico                          643
 0                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 Greece                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Cambodia                         19
 Trinadad&Tobago                  19
 Thailand                         18
 Laos                             18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Hungary                          13
 Honduras                         13
 Scotland                         12
 Holand-Netherlands                1
Name: Native_country, dtype: int64
train_df.columns
Index(['Age', 'Workclass', 'fnlgwt', 'Education', 'Education_num',
       'Marital_Status', 'Occupation', 'Relationship', 'Race', 'Sex',
       'Capital_Gain', 'Capital_Loss', 'Hours/Week', 'Native_country',
       'Income'],
      dtype='object')
train_df['Native_country'].value_counts()
 United-States                 29170
 Mexico                          643
 0                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 France                           29
 Greece                           29
 Ecuador                          28
 Ireland                          24
 Hong                             20
 Cambodia                         19
 Trinadad&Tobago                  19
 Thailand                         18
 Laos                             18
 Yugoslavia                       16
 Outlying-US(Guam-USVI-etc)       14
 Hungary                          13
 Honduras                         13
 Scotland                         12
 Holand-Netherlands                1
Name: Native_country, dtype: int64

People from Iran are earning the most

colormap = plt.cm.magma
plt.figure(figsize=(16,16))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train_df.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbe9e5fe080>

[Bivariate][Report] Answers with perfomed bivariate analysis

  • Which workclass people are earning the most?
    • Self-emp-inc
  • Which level of educated people are earning the most?
    • Doctorates and Prof-school
  • Which martial category people are earning the most?
    • Married-civ-spouse
  • people form which occupation category are earning the most?
    • Exec-managerial
  • People form wich relation category are earning the most?
    • Wife
  • Which gender people are earning the most?
    • Men
  • Which Race of people are earning the most?
    • Asian-Pac-Islander
  • People belongs to which Native country are earning the most?
    • Iran

Mulitvariate Analysis, pivoting

Questions:

  • Specific Counts of each in different workclass belongs to various education on Income basis
  • Specific Counts of each in different workclass belongs to various education on Gender basis
train_mult_index = train_df.set_index(keys = ['Income','Education','Native_country']).sort_index()
train_mult_index.tail()
Age Workclass fnlgwt Education_num Marital_Status Occupation Relationship Race Sex Capital_Gain Capital_Loss Hours/Week
Income Education Native_country
1 Some-college United-States 30 Self-emp-not-inc 176185 10 Married-spouse-absent Craft-repair Own-child White Male 0 0 60
United-States 53 Private 304504 10 Married-civ-spouse Transport-moving Husband White Male 0 1887 45
United-States 46 Private 42251 10 Married-civ-spouse Sales Husband White Male 0 0 45
United-States 46 Private 364548 10 Married-civ-spouse Exec-managerial Husband White Male 0 0 48
Yugoslavia 36 Self-emp-inc 337778 10 Married-civ-spouse Exec-managerial Husband White Male 0 0 60
train_mult_index.loc[(1, " Primary", " United-States"),].count()[0]
202

# People having Income >50K with Primary Education in United-Sates: 202

train_mult_index.stack().to_frame()
0
Income Education Native_country
0 Assoc-acdm ? Age 42
Workclass Self-emp-not-inc
fnlgwt 183765
Education_num 12
Marital_Status Married-civ-spouse
... ... ... ... ...
1 Some-college Yugoslavia Race White
Sex Male
Capital_Gain 0
Capital_Loss 0
Hours/Week 60

390732 rows × 1 columns

#collapse-hide

train_df
Age Workclass fnlgwt Education Education_num Marital_Status Occupation Relationship Race Sex Capital_Gain Capital_Loss Hours/Week Native_country Income
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States 0
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States 0
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States 0
3 53 Private 234721 Primary 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States 0
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States 0
32557 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States 1
32558 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States 0
32559 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States 0
32560 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States 1

32561 rows × 15 columns

iec_data = train_df.loc[:,("Income", "Education", "Workclass")]
iec_data
Income Education Workclass
0 0 Bachelors State-gov
1 0 Bachelors Self-emp-not-inc
2 0 HS-grad Private
3 0 Primary Private
4 0 Bachelors Private
... ... ... ...
32556 0 Assoc-acdm Private
32557 1 HS-grad Private
32558 0 HS-grad Private
32559 0 HS-grad Private
32560 1 HS-grad Self-emp-inc

32561 rows × 3 columns

iec_data.pivot_table(values='Income', index='Education', aggfunc='count', margins_name='Income')
Income
Education
Assoc-acdm 1067
Assoc-voc 1382
Bachelors 5355
Doctorate 413
HS-grad 10501
Masters 1723
Preschool 51
Primary 4202
Prof-school 576
Some-college 7291
iec_data[iec_data.Income == 1].pivot_table(values='Income', index='Education', aggfunc='count', margins_name='Income')
Income
Education
Assoc-acdm 265
Assoc-voc 361
Bachelors 2221
Doctorate 306
HS-grad 1675
Masters 959
Primary 244
Prof-school 423
Some-college 1387

#collapse-hide

iec_data_pivot = iec_data[iec_data.Income == 1].pivot_table(values='Income', index='Education', aggfunc='count', margins_name='Income')

plt.figure(figsize=(16, 8))
sns.heatmap(iec_data_pivot, annot=True, fmt='.1f', cbar_kws= {'label':'Income range in categories'}, cmap='coolwarm')
plt.title('Incomes of various educated categories in Income wise')
Text(0.5, 1, 'Incomes of various educated categories in Income wise')

#collapse-hide

train_df[train_df.Income == 1].pivot_table(values='Income', index=['Native_country', 'Education'], aggfunc='count')
Income
Native_country Education
0 Assoc-acdm 3
Assoc-voc 4
Bachelors 52
Doctorate 15
HS-grad 13
Masters 23
Primary 10
Prof-school 9
Some-college 17
Cambodia Bachelors 2
HS-grad 2
Primary 1
Some-college 2
Canada Assoc-acdm 1
Assoc-voc 3
Bachelors 9
Doctorate 4
HS-grad 8
Masters 3
Primary 1
Prof-school 1
Some-college 9
China Bachelors 8
Doctorate 5
HS-grad 3
Masters 4
Columbia Doctorate 1
Prof-school 1
Cuba Bachelors 4
Doctorate 1
... ... ...
South Prof-school 1
Some-college 3
Taiwan Bachelors 3
Doctorate 7
HS-grad 1
Masters 6
Prof-school 3
Thailand Assoc-acdm 1
Doctorate 1
Some-college 1
Trinadad&Tobago HS-grad 1
Primary 1
United-States Assoc-acdm 247
Assoc-voc 336
Bachelors 2016
Doctorate 249
HS-grad 1583
Masters 866
Primary 202
Prof-school 374
Some-college 1298
Vietnam Bachelors 1
Doctorate 1
HS-grad 1
Primary 2
Yugoslavia Assoc-acdm 1
Bachelors 2
HS-grad 1
Primary 1
Some-college 1

191 rows × 1 columns

gen_in_df = train_df.where(train_df.Income == 1).pivot_table(values=['Income'], 
                                                             index='Education',
                                                             columns='Workclass', 
                                                             aggfunc='count')

#collapse-hide
gen_in_df.sort_index()
Income
Workclass ? Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
Education
Assoc-acdm 6 19 28 170 18 18 6
Assoc-voc 13 15 25 256 19 21 12
Bachelors 45 95 162 1495 171 163 90
Doctorate 11 15 17 132 29 31 71
HS-grad 46 73 90 1119 119 179 49
Masters 18 47 173 534 57 59 71
Primary 9 2 10 163 15 40 5
Prof-school 8 23 19 171 78 106 18
Some-college 35 82 93 923 116 107 31
plt.figure(figsize=(16, 8))
sns.heatmap(gen_in_df.sort_index(), annot=True, fmt='.1f', cbar_kws= {'label':'Income range in categories'}, cmap='coolwarm')
plt.title('Incomes of various educated categories in Income wise')
Text(0.5, 1, 'Incomes of various educated categories in Income wise')

Bachelors of Education field in Private Worclass are 1495.0 Income count

gen_sex_df = train_df.where(train_df.Income == 1).pivot_table(values=['Sex'], 
                                                             index='Education',
                                                             columns='Workclass', 
                                                             aggfunc='count')
gen_sex_df
Sex
Workclass ? Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
Education
Assoc-acdm 6 19 28 170 18 18 6
Assoc-voc 13 15 25 256 19 21 12
Bachelors 45 95 162 1495 171 163 90
Doctorate 11 15 17 132 29 31 71
HS-grad 46 73 90 1119 119 179 49
Masters 18 47 173 534 57 59 71
Primary 9 2 10 163 15 40 5
Prof-school 8 23 19 171 78 106 18
Some-college 35 82 93 923 116 107 31

#collapse-hide

plt.figure(figsize=(16, 8))
sns.heatmap(gen_sex_df, annot=True, fmt='.1f', cbar_kws= {'label':'Income range in categories'}, cmap='coolwarm')
plt.title('Incomes of various educated categories in Gender wise')
Text(0.5, 1, 'Incomes of various educated categories in Gender wise')

Bachelors of Education field in Private Worclass are in marjority of Gender count basis

train_df.Sex.value_counts()
 Male      21790
 Female    10771
Name: Sex, dtype: int64
gen_in_df.index.names
FrozenList(['Education'])
gen_in_df.loc[:,'Sex']
Workclass ? Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
Education
Assoc-acdm 6 19 28 170 18 18 6
Assoc-voc 13 15 25 256 19 21 12
Bachelors 45 95 162 1495 171 163 90
Doctorate 11 15 17 132 29 31 71
HS-grad 46 73 90 1119 119 179 49
Masters 18 47 173 534 57 59 71
Primary 9 2 10 163 15 40 5
Prof-school 8 23 19 171 78 106 18
Some-college 35 82 93 923 116 107 31

[Report] Multivaiate Analysis

  • Specific Counts of each in different workclass belongs to various education on Income basis
    • Bachelors of Education field in Private Worclass are 1495.0 Income count
  • Specific Counts of each in different workclass belongs to various education on Gender basis
    • Bachelors of Education field in Private Worclass are in marjority of Gender count basis