US Adult Income
Exploratory Data Analysis on Us salary dataset.
Dataset of adult income
</br>
DataSet Overveiw
- Each row is labelled as either having a salary greater than ">50K" or "<=50K".
- This Data set is split into two CSV files, named adult-training.csv and adult-test.csv.
</br> To Build a binary classifier on the training dataset to predict the column income_bracket which has two possible values ">50K" and "<=50K" and evaluate the accuracy of the classifier with the test dataset. - categorical_columns = [workclass, education, marital_status, occupation, relationship, race, gender, native_country]
- continuous_columns = [age, education_num, capital_gain, capital_loss, hours_per_week]
- A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
Prediction task is to determine whether a person makes over 50K a year.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
train_file = "adult-training.csv"
columns = ['Age','Workclass','fnlgwt','Education','Education_num','Marital_Status',
'Occupation','Relationship','Race','Sex','Capital_Gain','Capital_Loss',
'Hours/Week','Native_country','Income']
#collapse-hide
train = pd.read_csv(train_file, names=columns)
train.head()
train.info()
train.shape
train.describe()
# Replacing '?' with nan
train.replace(' ?', np.nan, inplace=True)
train.isnull().sum()
train['Income'].value_counts()
sns.countplot(train['Income'])
plt.title("Count of Income Category")
plt.show()
- 24.08% data points labeled <50k
# Gender distribution
sns.countplot(train['Sex'])
plt.title("Count of Sex Category")
plt.show()
sns.stripplot(x='Sex', y='Hours/Week', data=train,hue='Income',marker='X')
# Workclass
wclass_plot = sns.countplot(train['Workclass'])
wclass_plot.set_xticklabels( wclass_plot.get_xticklabels(),rotation=50, ha="right")
plt.title("Count Plot of Workclass")
Private class working people are overall High in count
train['Education'].value_counts()
# Occupation
occ_plot = sns.countplot(train['Occupation'])
occ_plot.set_xticklabels(occ_plot.get_xticklabels(), rotation=40, ha="right")
plt.title("Count Plot of Occupation")
fig, axs = plt.subplots(ncols=2, nrows=4, figsize=(20, 20))
plt.subplots_adjust(hspace=0.68)
fig.delaxes(axs[3][1])
fig.suptitle('Subplot of Various Categorical Variables')
# Workclass
wc_plot = sns.countplot(train['Workclass'], ax=axs[0][0])
wc_plot.set_xticklabels(wc_plot.get_xticklabels(), rotation=40, ha="right")
axs[0][0].title.set_text('Count Plot of Workclass')
# Native country
nc_plot = sns.countplot(train['Native_country'], ax=axs[0][1])
nc_plot.set_xticklabels(nc_plot.get_xticklabels(), rotation=72, ha="right")
axs[0][1].title.set_text('Count plot of Native_country')
# Education
ed_plot = sns.countplot(train['Education'], ax=axs[1][0])
ed_plot.set_xticklabels(ed_plot.get_xticklabels(), rotation=40, ha="right")
axs[1][0].title.set_text('Count Plot of Education')
# Marital status
ms_plot = sns.countplot(train['Marital_Status'], ax=axs[1][1])
ms_plot.set_xticklabels(ms_plot.get_xticklabels(), rotation=40, ha="right")
axs[1][1].title.set_text('Count Plot of Martial Status')
# Relationship
rel_plot = sns.countplot(train['Relationship'], ax=axs[2][0])
rel_plot.set_xticklabels(rel_plot.get_xticklabels(), rotation=40, ha="right")
axs[2][0].title.set_text('Count Plot of Relationship')
# Race
race_plot = sns.countplot(train['Race'], ax=axs[2][1])
race_plot.set_xticklabels(race_plot.get_xticklabels(), rotation=40, ha="right")
axs[2][1].title.set_text('Count Plot of Race')
# Occupation
occ_plot = sns.countplot(train['Occupation'], ax=axs[3][0])
occ_plot.set_xticklabels(occ_plot.get_xticklabels(), rotation=40, ha="right")
axs[3][0].title.set_text('Count Plot of Occupation')
- Private : 22696
- Native_country:
- United-States : 29170
- Education:
- Hs-grad : 10501
- Marital_Status:
- Married-civ-spouse : 14976
- Relationship
- Husband : 13193
- Race
- White : 27816
- Occupation:
- Prof-specialty : 4140
- Never-worked : 7
- Native_country:
- Holand-Netherlands : 1
- Education:
- Preschool : 51
- Marital_Status:
- Married-AF-spouse : 23
- Relationship
- Other-relative : 981
- Race:
- other : 271
- Occupation:
- Armed-Forces : 9
plt.figure(figsize=(20, 6))
sns.countplot(train['Marital_Status'], hue=train['Income'])
plt.title("Count Plot of Maritial Status with Hue Income")
Most of the Never Married people are under Income of <=50k
plt.figure(figsize=(20, 6))
sns.countplot(train['Relationship'], hue=train['Income'])
plt.title("Count Plot of Relationship with Hue Income")
plt.figure(figsize=(20, 6))
sns.countplot(train['Age'], hue=train['Income'])
plt.title("Count Plot of Age with Hue Income")
sns.set_style("whitegrid")
sns.pairplot(train, hue="Income", size=3)
plt.show()
#collapse-hide
# Age with Income
sns.FacetGrid(train, hue="Income", size=6) \
.map(sns.distplot, "Age") \
.add_legend();
plt.show();
# Education_num with Education_num
sns.FacetGrid(train, hue="Income", size=6) \
.map(sns.distplot, "Education_num") \
.add_legend();
plt.show();
# Education_num with Capital_Gain
sns.FacetGrid(train, hue="Income", size=7) \
.map(sns.distplot, "Capital_Gain") \
.add_legend();
plt.show();
# Education_num with Capital_Loss
sns.FacetGrid(train, hue="Income", size=7) \
.map(sns.distplot, "Capital_Loss") \
.add_legend();
plt.show();
# Education_num with Hours/Week
sns.FacetGrid(train, hue="Income", size=7) \
.map(sns.distplot, "Hours/Week") \
.add_legend();
plt.show();
[Report]Univariate Analysis
Dataset is Imbalenced with Majority class label <=50k.
- 75.91% data points labeled <=50k
- 24.08% data points labeled <50k
Missing Data:
Workclass(1836), Occupation(1843), Native_country(583)
</br>
All belongs to Categorical data
-
Workclass
- Majority:
- Private Class, 22696
- Minority:
- Never-worked, 7
- Without-pay, 14
- Federal-gov, 960
- Majority:
-
Native Country
- Majority:
- United-States, 29170
- Minority:
- Holand-Netherlands, 1
- Scotland, 12
- Missing Data:
- ?, 583
- Majority:
-
Education
- Majority:
- HS-grad, 10501
- Some-college, 7291
- Bachelors, 5355
- Minority:
- Preschool, 51
- 1st-4th, 168
- 5th-6th, 333
- Majority:
-
Martial Status
- Majority:
- Married-civ-spouse, 14976
- Never-married, 10683
- Divorced, 4443
- Minority:
- Married-AF-spouse, 23
- Married-spouse-absent, 418
- Majority:
-
Relationship
- Majority:
- Husband, 13193
- Not-in-family, 8305
- Minority:
- Other-relative, 981
- Wife, 1568
- Majority:
-
Race
- Majority:
- White, 27816
- Black, 3124
- Minority:
- Other, 271
- Amer-Indian-Eskimo, 311
- Majority:
-
Occupation
- Majority:
- Prof-specialty, 4140
- Craft-repair, 4099
- Exec-managerial, 4066
- Minority:
- Armed-Forces, 9
- Priv-house-serv, 149
- Protective-serv, 649
- Missing Data:
- ?, 1843
- Majority:
Majority count aggrigation in each column:
- Workclass:
- Private : 22696
- Native_country:
- United-States : 29170
- Education:
- Hs-grad : 10501
- Marital_Status:
- Married-civ-spouse : 14976
- Relationship
- Husband : 13193
- Race
- White : 27816
- Occupation:
- Prof-specialty : 4140
Minority count aggrigation in each column:
- Workclass:
- Never-worked : 7
- Native_country:
- Holand-Netherlands : 1
- Education:
- Preschool : 51
- Marital_Status:
- Married-AF-spouse : 23
- Relationship
- Other-relative : 981
- Race:
- other : 271
- Occupation:
- Armed-Forces : 9
train_df = pd.read_csv("adult-training.csv", names=columns)
# Repalcing '?' to nan
#train_df.replace(' ?', np.nan, inplace=True)
Questions:
- Which workclass people are earning the most?
- Which level of educated people are earning the most?
- Which martial category people are earning the most?
- people form which occupation category are earning the most?
- People form wich relation category are earning the most?
- Which gender people are earning the most?
- Which Race of people are earning the most?
- People belongs to which Native country are earning the most?
train_df['Income'] = train['Income'].apply(lambda x: 1 if x==' >50K' else 0)
train_df['Workclass'].fillna(' 0', inplace=True)
sns.factorplot(x="Workclass", y="Income", data=train_df, kind="bar", size = 6,
palette = "muted")
plt.xticks(rotation=45);
plt.title("Bar plot of Work Class VS Income")
People from Self-emp-inc
are earning the most
sns.factorplot(x="Education",y="Income",data=train_df,kind="bar", size = 7,
palette = "muted")
plt.xticks(rotation=60);
plt.title("Bar plot of Education VS Income")
All the Grade Education can be combined in to Primary as a single feature
</br>
ref: https://www.kaggle.com/kost13/us-income-logistic-regression/comments
def primary(x):
if x in [' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th']:
return ' Primary'
else:
return x
train_df['Education'] = train_df['Education'].apply(primary)
sns.factorplot(x="Education", y="Income", data=train_df, kind="bar", size=7,
palette="muted")
plt.xticks(rotation=60);
Combinded [' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 12th'] to single feature Primary
- Doctorates and Prof-school people has Hihger Income >50k
sns.factorplot(x="Education_num",y="Income",data=train_df,kind="bar", size = 6,
palette = "muted")
plt.xticks(rotation=60);
plt.title("Factorplot of Education VS Income")
Relation Higher the Education_num
give better Income
sns.factorplot(x="Marital_Status",y="Income",data=train_df,kind="bar", size = 5,
palette = "muted")
plt.xticks(rotation=60);
print(train_df['Marital_Status'].value_counts())
plt.title("Factor plot of Martial Status VS Income")
People belonging to Married-civ-spouse
are earning the most.
#filing NaNs in Occupation with 0
train_df['Occupation'].replace(' ?', ' 0', inplace=True)
train_df['Occupation'].value_counts()
sns.factorplot(x="Occupation",y="Income",data=train_df,kind="bar", size = 8,
palette = "muted")
plt.xticks(rotation=60);
plt.title("Factor plot of Occupation VS Income")
people belonging to Exec-managerial
occupation are earning the most
sns.factorplot(x="Relationship", y="Income", data=train_df, size=5, kind="bar",
palette="muted")
plt.xticks(rotation=60)
plt.title("Factorplot of Relationship vs Income")
print(train_df['Relationship'].value_counts())
People belonging to wife
category of relationship are earning the most
sns.factorplot(x="Race", y="Income", data=train_df, size=5, kind="bar",
palette="muted")
plt.xticks(rotation=60)
plt.title("Factorplot of Race VS Income")
print(train_df['Race'].value_counts())
People belonging to Asian-Pac-Islander
are earning the most in Race
sns.factorplot(x="Sex", y="Income", data=train_df,size=5,kind="bar",
palette="muted")
plt.xticks(rotation=60)
plt.title("Factorplot of Sex VS Income")
print(train_df['Sex'].value_counts())
Male gender are earning the most
train_df['Native_country'].replace(' ?', ' 0', inplace=True)
#collapse-hide
sns.factorplot(x="Native_country", y="Income", data=train_df,size=13,kind="bar",
palette="muted")
plt.xticks(rotation=80)
print(train_df['Native_country'].value_counts())
train_df.columns
train_df['Native_country'].value_counts()
People from Iran
are earning the most
colormap = plt.cm.magma
plt.figure(figsize=(16,16))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train_df.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
[Bivariate][Report] Answers with perfomed bivariate analysis
- Which workclass people are earning the most?
- Self-emp-inc
- Which level of educated people are earning the most?
- Doctorates and Prof-school
- Which martial category people are earning the most?
- Married-civ-spouse
- people form which occupation category are earning the most?
- Exec-managerial
- People form wich relation category are earning the most?
- Wife
- Which gender people are earning the most?
- Men
- Which Race of people are earning the most?
- Asian-Pac-Islander
- People belongs to which Native country are earning the most?
- Iran
train_mult_index = train_df.set_index(keys = ['Income','Education','Native_country']).sort_index()
train_mult_index.tail()
train_mult_index.loc[(1, " Primary", " United-States"),].count()[0]
train_mult_index.stack().to_frame()
#collapse-hide
train_df
iec_data = train_df.loc[:,("Income", "Education", "Workclass")]
iec_data
iec_data.pivot_table(values='Income', index='Education', aggfunc='count', margins_name='Income')
iec_data[iec_data.Income == 1].pivot_table(values='Income', index='Education', aggfunc='count', margins_name='Income')
#collapse-hide
iec_data_pivot = iec_data[iec_data.Income == 1].pivot_table(values='Income', index='Education', aggfunc='count', margins_name='Income')
plt.figure(figsize=(16, 8))
sns.heatmap(iec_data_pivot, annot=True, fmt='.1f', cbar_kws= {'label':'Income range in categories'}, cmap='coolwarm')
plt.title('Incomes of various educated categories in Income wise')
#collapse-hide
train_df[train_df.Income == 1].pivot_table(values='Income', index=['Native_country', 'Education'], aggfunc='count')
gen_in_df = train_df.where(train_df.Income == 1).pivot_table(values=['Income'],
index='Education',
columns='Workclass',
aggfunc='count')
#collapse-hide
gen_in_df.sort_index()
plt.figure(figsize=(16, 8))
sns.heatmap(gen_in_df.sort_index(), annot=True, fmt='.1f', cbar_kws= {'label':'Income range in categories'}, cmap='coolwarm')
plt.title('Incomes of various educated categories in Income wise')
Bachelors of Education field in Private Worclass are 1495.0 Income count
gen_sex_df = train_df.where(train_df.Income == 1).pivot_table(values=['Sex'],
index='Education',
columns='Workclass',
aggfunc='count')
gen_sex_df
#collapse-hide
plt.figure(figsize=(16, 8))
sns.heatmap(gen_sex_df, annot=True, fmt='.1f', cbar_kws= {'label':'Income range in categories'}, cmap='coolwarm')
plt.title('Incomes of various educated categories in Gender wise')
Bachelors of Education field in Private Worclass are in marjority of Gender count basis
train_df.Sex.value_counts()
gen_in_df.index.names
gen_in_df.loc[:,'Sex']
[Report] Multivaiate Analysis
- Specific Counts of each in different workclass belongs to various education on Income basis
- Bachelors of Education field in Private Worclass are 1495.0 Income count
- Specific Counts of each in different workclass belongs to various education on Gender basis
- Bachelors of Education field in Private Worclass are in marjority of Gender count basis