Understanding and Applying Log Transformation

  • Log Transformation is used when events vary Drastically

Log-scale informs on relative changes (multiplicative), while linear-scale informs on absolute changes (additive). When do you use each? When you care about relative changes, use the log-scale; when you care about absolute changes, use linear-scale. This is true for distributions, but also for any quantity or changes in quantities.

Note, I use the word "care" here very specifically and intentionally. Without a model or a goal, your question cannot be answered; the model or goal defines which scale is important. If you're trying to model something, and the mechanism acts via a relative change, log-scale is critical to capturing the behavior seen in your data. But if the underlying model's mechanism is additive, you'll want to use linear-scale.

import random
import numpy as np
import matplotlib.pyplot as plt
from math import log
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import cross_val_score

Using Log Transformation to convert exponentioal data to Linear

for i in range(1,20):
    exp.append(random.randint(1,20))
srt_exp = sorted(exp)
np.array(srt_exp)
array([ 1,  1,  2,  2,  4,  6,  6,  7,  7,  8,  8,  8,  8,  9, 10, 10, 13,
       14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 17, 17, 17, 18, 19, 19, 20,
       23, 23, 23, 23, 24, 24, 27, 28, 29, 31, 31, 32, 37, 38, 39, 39, 43,
       48, 49, 52, 54, 58, 59, 60, 60, 60, 60, 61, 62, 63, 63, 65, 66, 66,
       68, 68, 69, 70, 71, 77, 78, 78, 79, 80, 81, 87, 88, 89, 92, 93, 96,
       96, 97])
exp_10 = [(10**num) for num in srt_exp]
#for num in srt_exp:
#    exp_10.append(10**num)
exp_10
[10,
 10,
 100,
 100,
 10000,
 10000,
 10000,
 10000,
 100000,
 100000,
 100000,
 100000,
 1000000,
 1000000,
 10000000,
 10000000,
 100000000,
 100000000,
 100000000,
 100000000,
 1000000000,
 1000000000,
 10000000000,
 10000000000,
 10000000000,
 10000000000,
 100000000000,
 100000000000,
 1000000000000,
 10000000000000,
 100000000000000,
 100000000000000,
 100000000000000,
 100000000000000,
 100000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 10000000000000000,
 10000000000000000,
 100000000000000000,
 100000000000000000,
 100000000000000000,
 100000000000000000,
 1000000000000000000,
 10000000000000000000,
 10000000000000000000,
 10000000000000000000,
 10000000000000000000,
 100000000000000000000,
 100000000000000000000000,
 100000000000000000000000,
 100000000000000000000000,
 100000000000000000000000,
 1000000000000000000000000,
 1000000000000000000000000,
 1000000000000000000000000000,
 10000000000000000000000000000,
 100000000000000000000000000000,
 10000000000000000000000000000000,
 10000000000000000000000000000000,
 100000000000000000000000000000000,
 10000000000000000000000000000000000000,
 100000000000000000000000000000000000000,
 1000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
plt.plot(exp_10)
[<matplotlib.lines.Line2D at 0x1192e1828>]
log_val = [log(x) for x in exp_10]
log_val_srt = sorted(log_val)
plt.plot(log_val_srt)
[<matplotlib.lines.Line2D at 0x119609a90>]

log values in intervel (0,1) are -ve

val_0_1 = []
for i in range(5):
    val_0_1.append(random.uniform(0,2))
lg_val_0_1 = [log(x) for x in val_0_1]
plt.plot(sorted(lg_val_0_1))
[<matplotlib.lines.Line2D at 0x119b76390>]

Plotting Log Curve on random data

rnd = []
for i in range(1,50):
    rnd.append(random.randint(1,50))
rnd = sorted(rnd)
rnd
[2,
 3,
 5,
 5,
 6,
 7,
 8,
 9,
 9,
 10,
 13,
 13,
 17,
 17,
 18,
 18,
 19,
 19,
 21,
 23,
 23,
 25,
 26,
 28,
 29,
 29,
 30,
 31,
 31,
 33,
 34,
 34,
 36,
 37,
 38,
 38,
 39,
 39,
 39,
 43,
 44,
 44,
 46,
 47,
 47,
 48,
 49,
 50,
 50]
lg_rnd = [log(x) for x in rnd]
plt.plot(lg_rnd)
[<matplotlib.lines.Line2D at 0x119d68940>]

Using log transformed word counts in the Online News Popularity dataset to predict article popularity

df = pd.read_csv('OnlineNewsPopularity.csv', delimiter=', ')
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.
df.head()
url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs ... min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
0 http://mashable.com/2013/01/07/amazon-instant-... 731.0 12.0 219.0 0.663594 1.0 0.815385 4.0 2.0 1.0 ... 0.100000 0.7 -0.350000 -0.600 -0.200000 0.500000 -0.187500 0.000000 0.187500 593
1 http://mashable.com/2013/01/07/ap-samsung-spon... 731.0 9.0 255.0 0.604743 1.0 0.791946 3.0 1.0 1.0 ... 0.033333 0.7 -0.118750 -0.125 -0.100000 0.000000 0.000000 0.500000 0.000000 711
2 http://mashable.com/2013/01/07/apple-40-billio... 731.0 9.0 211.0 0.575130 1.0 0.663866 3.0 1.0 1.0 ... 0.100000 1.0 -0.466667 -0.800 -0.133333 0.000000 0.000000 0.500000 0.000000 1500
3 http://mashable.com/2013/01/07/astronaut-notre... 731.0 9.0 531.0 0.503788 1.0 0.665635 9.0 0.0 1.0 ... 0.136364 0.8 -0.369697 -0.600 -0.166667 0.000000 0.000000 0.500000 0.000000 1200
4 http://mashable.com/2013/01/07/att-u-verse-apps/ 731.0 13.0 1072.0 0.415646 1.0 0.540890 19.0 19.0 20.0 ... 0.033333 1.0 -0.220192 -0.500 -0.050000 0.454545 0.136364 0.045455 0.136364 505

5 rows × 61 columns

df.n_tokens_content.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x122d6b630>
df.n_tokens_content.plot(kind='kde')
<matplotlib.axes._subplots.AxesSubplot at 0x11e0f76d8>
  • Take the log transform of the 'n_tokens_content' feature, which
  • represents the number of words (tokens) in a news article.
  • Note that we add 1 to the raw count to prevent the logarithm from
  • exploding into negative infinity in case the count is zero.
df['log_n_tokens_content'] = np.log10(df['n_tokens_content'] + 1)
df.log_n_tokens_content.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x11fc9af98>
df.log_n_tokens_content.plot(kind='kde')
<matplotlib.axes._subplots.AxesSubplot at 0x11fe9c668>
fig, (ax1, ax2) = plt.subplots(2,1)
df['n_tokens_content'].hist(ax=ax1, bins=100)
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Articles', fontsize=14)
df['log_n_tokens_content'].hist(ax=ax2, bins=100)
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Log of Number of Words', fontsize=14)
ax2.set_ylabel('Number of Articles', fontsize=14)
Text(0, 0.5, 'Number of Articles')
  • Train two linear regression models to predict the number of shares
  • of an article, one using the original feature and the other the
  • log transformed version.
m_orig = linear_model.LinearRegression()
scores_orig = cross_val_score(m_orig, df[['n_tokens_content']],df['shares'], cv=10)
m_log = linear_model.LinearRegression()
scores_log = cross_val_score(m_log, df[['log_n_tokens_content']], df['shares'], cv=10)
print("R-squared score without log transform: %0.5f (+/- %0.5f)" % (scores_orig.mean(), scores_orig.std() * 2))
print("R-squared score with log transform: %0.5f (+/- %0.5f)" % (scores_log.mean(), scores_log.std() * 2))
R-squared score without log transform: -0.00242 (+/- 0.00509)
R-squared score with log transform: -0.00114 (+/- 0.00418)
fig2, (ax1, ax2) = plt.subplots(2,1)
ax1.scatter(df['n_tokens_content'], df['shares'])
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Shares', fontsize=14)
ax2.scatter(df['log_n_tokens_content'], df['shares'])
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Log of the Number of Words in Article', fontsize=14)
ax2.set_ylabel('Number of Shares', fontsize=14)
Text(0, 0.5, 'Number of Shares')

Scatter plots of number of words (input) versus number of shares (target) in the Online News Popularity dataset—the top plot visualizes the original feature, and the bottom plot shows the scatter plot after log transformation

Why log(x), where x<0 is not defined ?

why logarithm of -ve numbers is not defined ?

for i in range(10,-11,-1):
    print("2^{} = ".format(i), 2**i," ", "log({}) = ".format(2**i), log(2**i,2) )
print("2^-∞ = 0", "log(0) = -∞")
2^10 =  1024   log(1024) =  10.0
2^9 =  512   log(512) =  9.0
2^8 =  256   log(256) =  8.0
2^7 =  128   log(128) =  7.0
2^6 =  64   log(64) =  6.0
2^5 =  32   log(32) =  5.0
2^4 =  16   log(16) =  4.0
2^3 =  8   log(8) =  3.0
2^2 =  4   log(4) =  2.0
2^1 =  2   log(2) =  1.0
2^0 =  1   log(1) =  0.0
2^-1 =  0.5   log(0.5) =  -1.0
2^-2 =  0.25   log(0.25) =  -2.0
2^-3 =  0.125   log(0.125) =  -3.0
2^-4 =  0.0625   log(0.0625) =  -4.0
2^-5 =  0.03125   log(0.03125) =  -5.0
2^-6 =  0.015625   log(0.015625) =  -6.0
2^-7 =  0.0078125   log(0.0078125) =  -7.0
2^-8 =  0.00390625   log(0.00390625) =  -8.0
2^-9 =  0.001953125   log(0.001953125) =  -9.0
2^-10 =  0.0009765625   log(0.0009765625) =  -10.0
2^-∞ = 0 log(0) = -∞

The values for 2^something is always +ve there's no way of occuring a -ve values, so as there's to chance of gettign a -ve value, Hence there's no way of calculating log(-ve)

Why log values between (0,1) is negative ?

From the above calculation, 2^0 starts with 1 and it goes on and it never stops as when you start cutting a cake in to two, three, four, five and so on... its goes up there is something more ahead to make in to pieces, so we say at infinity it's 0. 2^infinity it's 0

Similarly, for the log values log(0) = -∞ and log(1) = 0. between (0,1) -> log values will be (1,-∞)