Understanding and Applying Log Transformation

Log Transformation is used when events vary Drastically

Log-scale informs on relative changes (multiplicative), while linear-scale informs on absolute changes (additive). When do you use each? When you care about relative changes, use the log-scale; when you care about absolute changes, use linear-scale. This is true for distributions, but also for any quantity or changes in quantities.

Note, I use the word "care" here very specifically and intentionally. Without a model or a goal, your question cannot be answered; the model or goal defines which scale is important. If you're trying to model something, and the mechanism acts via a relative change, log-scale is critical to capturing the behavior seen in your data. But if the underlying model's mechanism is additive, you'll want to use linear-scale.

ref:https://stats.stackexchange.com/questions/18844/when-and-why-should-you-take-the-log-of-a-distribution-of-numbers
The log function maps the small range of numbers between (0, 1) to the entire range of negative numbers (–∞, 0). The function log10(x) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on. In other words, the log function compresses the range of large numbers and expands the range of small numbers. The larger x is, the slower log(x) increments.
ref: Book :https://learning.oreilly.com/library/view/feature-engineering-for/9781491953235

import random
import numpy as np
import matplotlib.pyplot as plt
from math import log
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import cross_val_score

Using Log Transformation to convert exponentioal data to Linear

for i in range(1,20):
    exp.append(random.randint(1,20))

srt_exp = sorted(exp)

np.array(srt_exp)

array([ 1,  1,  2,  2,  4,  6,  6,  7,  7,  8,  8,  8,  8,  9, 10, 10, 13,
       14, 14, 14, 14, 15, 15, 15, 15, 15, 16, 17, 17, 17, 18, 19, 19, 20,
       23, 23, 23, 23, 24, 24, 27, 28, 29, 31, 31, 32, 37, 38, 39, 39, 43,
       48, 49, 52, 54, 58, 59, 60, 60, 60, 60, 61, 62, 63, 63, 65, 66, 66,
       68, 68, 69, 70, 71, 77, 78, 78, 79, 80, 81, 87, 88, 89, 92, 93, 96,
       96, 97])

exp_10 = [(10**num) for num in srt_exp]
#for num in srt_exp:
#    exp_10.append(10**num)

exp_10

[10,
 10,
 100,
 100,
 10000,
 10000,
 10000,
 10000,
 100000,
 100000,
 100000,
 100000,
 1000000,
 1000000,
 10000000,
 10000000,
 100000000,
 100000000,
 100000000,
 100000000,
 1000000000,
 1000000000,
 10000000000,
 10000000000,
 10000000000,
 10000000000,
 100000000000,
 100000000000,
 1000000000000,
 10000000000000,
 100000000000000,
 100000000000000,
 100000000000000,
 100000000000000,
 100000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 1000000000000000,
 10000000000000000,
 10000000000000000,
 100000000000000000,
 100000000000000000,
 100000000000000000,
 100000000000000000,
 1000000000000000000,
 10000000000000000000,
 10000000000000000000,
 10000000000000000000,
 10000000000000000000,
 100000000000000000000,
 100000000000000000000000,
 100000000000000000000000,
 100000000000000000000000,
 100000000000000000000000,
 1000000000000000000000000,
 1000000000000000000000000,
 1000000000000000000000000000,
 10000000000000000000000000000,
 100000000000000000000000000000,
 10000000000000000000000000000000,
 10000000000000000000000000000000,
 100000000000000000000000000000000,
 10000000000000000000000000000000000000,
 100000000000000000000000000000000000000,
 1000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000,
 10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]

plt.plot(exp_10)

[<matplotlib.lines.Line2D at 0x1192e1828>]

log_val = [log(x) for x in exp_10]

log_val_srt = sorted(log_val)

plt.plot(log_val_srt)

[<matplotlib.lines.Line2D at 0x119609a90>]

log values in intervel (0,1) are -ve

val_0_1 = []
for i in range(5):
    val_0_1.append(random.uniform(0,2))

lg_val_0_1 = [log(x) for x in val_0_1]

plt.plot(sorted(lg_val_0_1))

[<matplotlib.lines.Line2D at 0x119b76390>]

Plotting Log Curve on random data

rnd = []
for i in range(1,50):
    rnd.append(random.randint(1,50))
rnd = sorted(rnd)
rnd

[2,
 3,
 5,
 5,
 6,
 7,
 8,
 9,
 9,
 10,
 13,
 13,
 17,
 17,
 18,
 18,
 19,
 19,
 21,
 23,
 23,
 25,
 26,
 28,
 29,
 29,
 30,
 31,
 31,
 33,
 34,
 34,
 36,
 37,
 38,
 38,
 39,
 39,
 39,
 43,
 44,
 44,
 46,
 47,
 47,
 48,
 49,
 50,
 50]

lg_rnd = [log(x) for x in rnd]
plt.plot(lg_rnd)

[<matplotlib.lines.Line2D at 0x119d68940>]

Using log transformed word counts in the Online News Popularity dataset to predict article popularity

ref: https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/

df = pd.read_csv('OnlineNewsPopularity.csv', delimiter=', ')

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.

df.head()

df.n_tokens_content.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x122d6b630>

df.n_tokens_content.plot(kind='kde')

<matplotlib.axes._subplots.AxesSubplot at 0x11e0f76d8>

Take the log transform of the 'n_tokens_content' feature, which
represents the number of words (tokens) in a news article.
Note that we add 1 to the raw count to prevent the logarithm from
exploding into negative infinity in case the count is zero.

df['log_n_tokens_content'] = np.log10(df['n_tokens_content'] + 1)

df.log_n_tokens_content.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x11fc9af98>

df.log_n_tokens_content.plot(kind='kde')

<matplotlib.axes._subplots.AxesSubplot at 0x11fe9c668>

fig, (ax1, ax2) = plt.subplots(2,1)
df['n_tokens_content'].hist(ax=ax1, bins=100)
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Articles', fontsize=14)
df['log_n_tokens_content'].hist(ax=ax2, bins=100)
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Log of Number of Words', fontsize=14)
ax2.set_ylabel('Number of Articles', fontsize=14)

Text(0, 0.5, 'Number of Articles')

Train two linear regression models to predict the number of shares
of an article, one using the original feature and the other the
log transformed version.

m_orig = linear_model.LinearRegression()
scores_orig = cross_val_score(m_orig, df[['n_tokens_content']],df['shares'], cv=10)
m_log = linear_model.LinearRegression()
scores_log = cross_val_score(m_log, df[['log_n_tokens_content']], df['shares'], cv=10)
print("R-squared score without log transform: %0.5f (+/- %0.5f)" % (scores_orig.mean(), scores_orig.std() * 2))
print("R-squared score with log transform: %0.5f (+/- %0.5f)" % (scores_log.mean(), scores_log.std() * 2))

R-squared score without log transform: -0.00242 (+/- 0.00509)
R-squared score with log transform: -0.00114 (+/- 0.00418)

fig2, (ax1, ax2) = plt.subplots(2,1)
ax1.scatter(df['n_tokens_content'], df['shares'])
ax1.tick_params(labelsize=14)
ax1.set_xlabel('Number of Words in Article', fontsize=14)
ax1.set_ylabel('Number of Shares', fontsize=14)
ax2.scatter(df['log_n_tokens_content'], df['shares'])
ax2.tick_params(labelsize=14)
ax2.set_xlabel('Log of the Number of Words in Article', fontsize=14)
ax2.set_ylabel('Number of Shares', fontsize=14)

Text(0, 0.5, 'Number of Shares')

Scatter plots of number of words (input) versus number of shares (target) in the Online News Popularity dataset—the top plot visualizes the original feature, and the bottom plot shows the scatter plot after log transformation

Why log(x), where x<0 is not defined ?

why logarithm of -ve numbers is not defined ?

for i in range(10,-11,-1):
    print("2^{} = ".format(i), 2**i," ", "log({}) = ".format(2**i), log(2**i,2) )
print("2^-∞ = 0", "log(0) = -∞")

2^10 =  1024   log(1024) =  10.0
2^9 =  512   log(512) =  9.0
2^8 =  256   log(256) =  8.0
2^7 =  128   log(128) =  7.0
2^6 =  64   log(64) =  6.0
2^5 =  32   log(32) =  5.0
2^4 =  16   log(16) =  4.0
2^3 =  8   log(8) =  3.0
2^2 =  4   log(4) =  2.0
2^1 =  2   log(2) =  1.0
2^0 =  1   log(1) =  0.0
2^-1 =  0.5   log(0.5) =  -1.0
2^-2 =  0.25   log(0.25) =  -2.0
2^-3 =  0.125   log(0.125) =  -3.0
2^-4 =  0.0625   log(0.0625) =  -4.0
2^-5 =  0.03125   log(0.03125) =  -5.0
2^-6 =  0.015625   log(0.015625) =  -6.0
2^-7 =  0.0078125   log(0.0078125) =  -7.0
2^-8 =  0.00390625   log(0.00390625) =  -8.0
2^-9 =  0.001953125   log(0.001953125) =  -9.0
2^-10 =  0.0009765625   log(0.0009765625) =  -10.0
2^-∞ = 0 log(0) = -∞

The values for 2^something is always +ve there's no way of occuring a -ve values, so as there's to chance of gettign a -ve value, Hence there's no way of calculating log(-ve)

Why log values between (0,1) is negative ?

From the above calculation, 2^0 starts with 1 and it goes on and it never stops as when you start cutting a cake in to two, three, four, five and so on... its goes up there is something more ahead to make in to pieces, so we say at infinity it's 0. 2^infinity it's 0

Similarly, for the log values log(0) = -∞ and log(1) = 0. between (0,1) -> log values will be (1,-∞)

	url	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	...	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	shares
0	http://mashable.com/2013/01/07/amazon-instant-...	731.0	12.0	219.0	0.663594	1.0	0.815385	4.0	2.0	1.0	...	0.100000	0.7	-0.350000	-0.600	-0.200000	0.500000	-0.187500	0.000000	0.187500	593
1	http://mashable.com/2013/01/07/ap-samsung-spon...	731.0	9.0	255.0	0.604743	1.0	0.791946	3.0	1.0	1.0	...	0.033333	0.7	-0.118750	-0.125	-0.100000	0.000000	0.000000	0.500000	0.000000	711
2	http://mashable.com/2013/01/07/apple-40-billio...	731.0	9.0	211.0	0.575130	1.0	0.663866	3.0	1.0	1.0	...	0.100000	1.0	-0.466667	-0.800	-0.133333	0.000000	0.000000	0.500000	0.000000	1500
3	http://mashable.com/2013/01/07/astronaut-notre...	731.0	9.0	531.0	0.503788	1.0	0.665635	9.0	0.0	1.0	...	0.136364	0.8	-0.369697	-0.600	-0.166667	0.000000	0.000000	0.500000	0.000000	1200
4	http://mashable.com/2013/01/07/att-u-verse-apps/	731.0	13.0	1072.0	0.415646	1.0	0.540890	19.0	19.0	20.0	...	0.033333	1.0	-0.220192	-0.500	-0.050000	0.454545	0.136364	0.045455	0.136364	505