Before Applying Log Transformation
Uncovering the fundamentals of logarithms with implementation of random and toy data GitHub
- Understanding and Applying Log Transformation
Understanding and Applying Log Transformation
- Log Transformation is used when events vary Drastically
Log-scale informs on relative changes (multiplicative), while linear-scale informs on absolute changes (additive). When do you use each? When you care about relative changes, use the log-scale; when you care about absolute changes, use linear-scale. This is true for distributions, but also for any quantity or changes in quantities.
Note, I use the word "care" here very specifically and intentionally. Without a model or a goal, your question cannot be answered; the model or goal defines which scale is important. If you're trying to model something, and the mechanism acts via a relative change, log-scale is critical to capturing the behavior seen in your data. But if the underlying model's mechanism is additive, you'll want to use linear-scale.
-
The log function maps the small range of numbers between (0, 1) to the entire range of negative numbers (–∞, 0). The function log10(x) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on. In other words, the log function compresses the range of large numbers and expands the range of small numbers. The larger x is, the slower log(x) increments.
- ref: Book :https://learning.oreilly.com/library/view/feature-engineering-for/9781491953235
import random
import numpy as np
import matplotlib.pyplot as plt
from math import log
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
for i in range(1,20):
exp.append(random.randint(1,20))
srt_exp = sorted(exp)
np.array(srt_exp)
exp_10 = [(10**num) for num in srt_exp]
#for num in srt_exp:
# exp_10.append(10**num)
exp_10
plt.plot(exp_10)
log_val = [log(x) for x in exp_10]
log_val_srt = sorted(log_val)
plt.plot(log_val_srt)
val_0_1 = []
for i in range(5):
val_0_1.append(random.uniform(0,2))
lg_val_0_1 = [log(x) for x in val_0_1]
plt.plot(sorted(lg_val_0_1))
rnd = []
for i in range(1,50):
rnd.append(random.randint(1,50))
rnd = sorted(rnd)
rnd
lg_rnd = [log(x) for x in rnd]
plt.plot(lg_rnd)
df = pd.read_csv('OnlineNewsPopularity.csv', delimiter=', ')
df.head()
df.n_tokens_content.hist()
df.n_tokens_content.plot(kind='kde')
- Take the log transform of the 'n_tokens_content' feature, which
- represents the number of words (tokens) in a news article.
- Note that we add 1 to the raw count to prevent the logarithm from
- exploding into negative infinity in case the count is zero.
df['log_n_tokens_content'] = np.log10(df['n_tokens_content'] + 1)
df.log_n_tokens_content.hist()
df.log_n_tokens_content.plot(kind='kde')
fig, (ax1, ax2) = plt.subplots(2,1)
df['n_tokens_content'].hist(ax=ax1, bins=100)
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Articles', fontsize=14)
df['log_n_tokens_content'].hist(ax=ax2, bins=100)
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Log of Number of Words', fontsize=14)
ax2.set_ylabel('Number of Articles', fontsize=14)
- Train two linear regression models to predict the number of shares
- of an article, one using the original feature and the other the
- log transformed version.
m_orig = linear_model.LinearRegression()
scores_orig = cross_val_score(m_orig, df[['n_tokens_content']],df['shares'], cv=10)
m_log = linear_model.LinearRegression()
scores_log = cross_val_score(m_log, df[['log_n_tokens_content']], df['shares'], cv=10)
print("R-squared score without log transform: %0.5f (+/- %0.5f)" % (scores_orig.mean(), scores_orig.std() * 2))
print("R-squared score with log transform: %0.5f (+/- %0.5f)" % (scores_log.mean(), scores_log.std() * 2))
fig2, (ax1, ax2) = plt.subplots(2,1)
ax1.scatter(df['n_tokens_content'], df['shares'])
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Shares', fontsize=14)
ax2.scatter(df['log_n_tokens_content'], df['shares'])
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Log of the Number of Words in Article', fontsize=14)
ax2.set_ylabel('Number of Shares', fontsize=14)
Scatter plots of number of words (input) versus number of shares (target) in the Online News Popularity dataset—the top plot visualizes the original feature, and the bottom plot shows the scatter plot after log transformation
for i in range(10,-11,-1):
print("2^{} = ".format(i), 2**i," ", "log({}) = ".format(2**i), log(2**i,2) )
print("2^-∞ = 0", "log(0) = -∞")
The values for 2^something is always +ve there's no way of occuring a -ve values, so as there's to chance of gettign a -ve value, Hence there's no way of calculating log(-ve)
From the above calculation, 2^0 starts with 1 and it goes on and it never stops as when you start cutting a cake in to two, three, four, five and so on... its goes up there is something more ahead to make in to pieces, so we say at infinity it's 0. 2^infinity it's 0
Similarly, for the log values log(0) = -∞ and log(1) = 0. between (0,1) -> log values will be (1,-∞)