brendan_oconnor_ai brendan_oconnor_ai-2013 brendan_oconnor_ai-2013-201 knowledge-graph by maker-knowledge-mining

201 brendan oconnor ai-2013-10-31-tanh is a rescaled logistic sigmoid function


meta infos for this blog

Source: html

Introduction: This confused me for a while when I first learned it, so in case it helps anyone else: The logistic sigmoid function, a.k.a. the inverse logit function, is \[ g(x) = \frac{ e^x }{1 + e^x} \] Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.) \[ tanh(x) = 2 g(2x) - 1 \] It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot. The two functions are plotted below. Blue is the logistic function, and red is tanh.


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This confused me for a while when I first learned it, so in case it helps anyone else: The logistic sigmoid function, a. [sent-1, score-0.984]

2 the inverse logit function, is \[ g(x) = \frac{ e^x }{1 + e^x} \] Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). [sent-4, score-1.214]

3 hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. [sent-8, score-0.763]

4 ) \[ tanh(x) = 2 g(2x) - 1 \] It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). [sent-10, score-0.365]

5 The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot. [sent-11, score-1.481]


similar blogs computed by tfidf model

tfidf for this blog:

wordName wordTfidf (topN-words)

[('tanh', 0.609), ('logistic', 0.36), ('function', 0.357), ('range', 0.255), ('sigmoid', 0.243), ('functions', 0.17), ('outputs', 0.148), ('frac', 0.148), ('show', 0.114), ('horizontal', 0.106), ('logit', 0.106), ('inverse', 0.106), ('helps', 0.097), ('interpreted', 0.097), ('convenient', 0.097), ('neural', 0.097), ('probabilities', 0.097), ('leads', 0.09), ('confused', 0.085), ('networks', 0.081), ('red', 0.081), ('blue', 0.081), ('plotted', 0.081), ('definition', 0.074), ('tends', 0.074), ('regression', 0.065), ('output', 0.065), ('learned', 0.061), ('anyone', 0.058), ('else', 0.057), ('often', 0.045), ('case', 0.045), ('easy', 0.044), ('standard', 0.043), ('say', 0.036), ('two', 0.036), ('first', 0.035)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 1.0 201 brendan oconnor ai-2013-10-31-tanh is a rescaled logistic sigmoid function

Introduction: This confused me for a while when I first learned it, so in case it helps anyone else: The logistic sigmoid function, a.k.a. the inverse logit function, is \[ g(x) = \frac{ e^x }{1 + e^x} \] Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.) \[ tanh(x) = 2 g(2x) - 1 \] It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot. The two functions are plotted below. Blue is the logistic function, and red is tanh.

2 0.12303387 169 brendan oconnor ai-2011-05-20-Log-normal and logistic-normal terminology

Introduction: I was cleaning my office and found a back-of-envelope diagram Shay drew me once, so I’m writing it up to not forget.  The definitions of the logistic-normal and log-normal distributions are a little confusing with regard to their relationship to the normal distribution.  If you draw samples from one, the arrows below show the transformation to make it such you have samples from another. For example, if x ~ Normal , then transforming as  y=exp(x) implies y ~ LogNormal .  The adjective terminology is inverted: the logistic function goes from normal to logistic-normal, but the log function goes from log-normal to normal (other way!).  The log of the log-normal is normal, but it’s the logit of the logistic normal that’s normal. Here are densities of these different distributions via transformations from a standard normal. In R:   x=rnorm(1e6); hist(x); hist(exp(x)/(1+exp(x)); hist(exp(x)) Just to make things more confusing, note the logistic-normal distributi

3 0.1195465 164 brendan oconnor ai-2011-01-11-Please report your SVM’s kernel!

Introduction: I’m tired of reading papers that use an SVM but don’t say which kernel they used.  (There’s tons of such papers in NLP and, I think, other areas that do applied machine learning.)  I suspect a lot of these papers are actually using a linear kernel. An un-kernelized, linear SVM is nearly the same as logistic regression — every feature independently increases or decreases the classifier’s output prediction.  But a quadratic kernelized SVM is much more like boosted depth-2 decision trees.  It can do automatic combinations of pairs of features — a potentially very different thing, since you can start throwing in features that don’t do anything on their own but might have useful interactions with others.  (And of course, more complicated kernels do progressively more complicated and non-linear things.) I have heard people say they download an SVM package, try a bunch of different kernels, and find the linear kernel is the best. In such cases they could have just used a logistic regr

4 0.077489525 177 brendan oconnor ai-2011-11-11-Memorizing small tables

Introduction: Lately, I’ve been trying to memorize very small tables, especially for better intuitions and rule-of-thumb calculations. At the moment I have these above my desk: The first one is a few entries in a natural logarithm table. There are all these stories about how in the slide rule era, people would develop better intuitions about the scale of logarithms because they physically engaged with them all the time. I spend lots of time looking at log-likelihoods, log-odds-ratios, and logistic regression coefficients, so I think it would be nice to have quick intuitions about what they are. (Though the Gelman and Hill textbook has an interesting argument against odds scale interpretations of logistic regression coefficients.) The second one are some zsh filename manipulation shortcuts . OK, this is more narrow than the others, but pretty useful for me at least. The third one are rough unit equivalencies for data rates over time. I find this very important for quickly determ

5 0.057201084 182 brendan oconnor ai-2012-03-13-Cosine similarity, Pearson correlation, and OLS coefficients

Introduction: Cosine similarity, Pearson correlations, and OLS coefficients can all be viewed as variants on the inner product — tweaked in different ways for centering and magnitude (i.e. location and scale, or something like that). Details: You have two vectors \(x\) and \(y\) and want to measure similarity between them. A basic similarity function is the inner product \[ Inner(x,y) = \sum_i x_i y_i = \langle x, y \rangle \] If x tends to be high where y is also high, and low where y is low, the inner product will be high — the vectors are more similar. The inner product is unbounded. One way to make it bounded between -1 and 1 is to divide by the vectors’ L2 norms, giving the cosine similarity \[ CosSim(x,y) = \frac{\sum_i x_i y_i}{ \sqrt{ \sum_i x_i^2} \sqrt{ \sum_i y_i^2 } } = \frac{ \langle x,y \rangle }{ ||x||\ ||y|| } \] This is actually bounded between 0 and 1 if x and y are non-negative. Cosine similarity has an interpretation as the cosine of the angle between t

6 0.04439925 178 brendan oconnor ai-2011-11-13-Bayes update view of pointwise mutual information

7 0.042741839 179 brendan oconnor ai-2012-02-02-Histograms — matplotlib vs. R

8 0.041701276 95 brendan oconnor ai-2008-03-18-color name study i did

9 0.039155547 183 brendan oconnor ai-2012-04-11-F-scores, Dice, and Jaccard set similarity

10 0.038318437 136 brendan oconnor ai-2009-04-01-Binary classification evaluation in R via ROCR

11 0.038042866 195 brendan oconnor ai-2013-04-21-What inputs do Monte Carlo algorithms need?

12 0.031412836 175 brendan oconnor ai-2011-09-25-Information theory stuff

13 0.029201737 204 brendan oconnor ai-2014-04-26-Replot: departure delays vs flight time speed-up

14 0.029059516 199 brendan oconnor ai-2013-08-31-Probabilistic interpretation of the B3 coreference resolution metric

15 0.0266429 129 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!

16 0.026463699 6 brendan oconnor ai-2005-06-25-idea: Morals are heuristics for socially optimal behavior

17 0.02574129 185 brendan oconnor ai-2012-07-17-p-values, CDF’s, NLP etc.

18 0.022914248 184 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

19 0.022185612 154 brendan oconnor ai-2009-09-10-Don’t MAWK AWK – the fastest and most elegant big data munging language!

20 0.021768663 111 brendan oconnor ai-2008-08-16-A better Obama vs McCain poll aggregation


similar blogs computed by lsi model

lsi for this blog:

topicId topicWeight

[(0, -0.063), (1, -0.055), (2, 0.061), (3, -0.084), (4, -0.023), (5, 0.086), (6, 0.007), (7, -0.121), (8, -0.076), (9, -0.0), (10, 0.034), (11, -0.066), (12, 0.0), (13, 0.014), (14, -0.053), (15, 0.047), (16, -0.089), (17, 0.029), (18, 0.043), (19, -0.131), (20, -0.051), (21, 0.073), (22, -0.018), (23, 0.053), (24, -0.053), (25, 0.028), (26, -0.057), (27, -0.049), (28, -0.014), (29, 0.012), (30, 0.034), (31, 0.06), (32, 0.236), (33, -0.065), (34, 0.17), (35, -0.103), (36, 0.08), (37, 0.122), (38, -0.019), (39, 0.001), (40, 0.048), (41, 0.046), (42, 0.047), (43, 0.008), (44, -0.025), (45, 0.038), (46, 0.018), (47, 0.203), (48, 0.025), (49, 0.072)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.99808156 201 brendan oconnor ai-2013-10-31-tanh is a rescaled logistic sigmoid function

Introduction: This confused me for a while when I first learned it, so in case it helps anyone else: The logistic sigmoid function, a.k.a. the inverse logit function, is \[ g(x) = \frac{ e^x }{1 + e^x} \] Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.) \[ tanh(x) = 2 g(2x) - 1 \] It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot. The two functions are plotted below. Blue is the logistic function, and red is tanh.

2 0.65911597 164 brendan oconnor ai-2011-01-11-Please report your SVM’s kernel!

Introduction: I’m tired of reading papers that use an SVM but don’t say which kernel they used.  (There’s tons of such papers in NLP and, I think, other areas that do applied machine learning.)  I suspect a lot of these papers are actually using a linear kernel. An un-kernelized, linear SVM is nearly the same as logistic regression — every feature independently increases or decreases the classifier’s output prediction.  But a quadratic kernelized SVM is much more like boosted depth-2 decision trees.  It can do automatic combinations of pairs of features — a potentially very different thing, since you can start throwing in features that don’t do anything on their own but might have useful interactions with others.  (And of course, more complicated kernels do progressively more complicated and non-linear things.) I have heard people say they download an SVM package, try a bunch of different kernels, and find the linear kernel is the best. In such cases they could have just used a logistic regr

3 0.61206406 169 brendan oconnor ai-2011-05-20-Log-normal and logistic-normal terminology

Introduction: I was cleaning my office and found a back-of-envelope diagram Shay drew me once, so I’m writing it up to not forget.  The definitions of the logistic-normal and log-normal distributions are a little confusing with regard to their relationship to the normal distribution.  If you draw samples from one, the arrows below show the transformation to make it such you have samples from another. For example, if x ~ Normal , then transforming as  y=exp(x) implies y ~ LogNormal .  The adjective terminology is inverted: the logistic function goes from normal to logistic-normal, but the log function goes from log-normal to normal (other way!).  The log of the log-normal is normal, but it’s the logit of the logistic normal that’s normal. Here are densities of these different distributions via transformations from a standard normal. In R:   x=rnorm(1e6); hist(x); hist(exp(x)/(1+exp(x)); hist(exp(x)) Just to make things more confusing, note the logistic-normal distributi

4 0.50304621 177 brendan oconnor ai-2011-11-11-Memorizing small tables

Introduction: Lately, I’ve been trying to memorize very small tables, especially for better intuitions and rule-of-thumb calculations. At the moment I have these above my desk: The first one is a few entries in a natural logarithm table. There are all these stories about how in the slide rule era, people would develop better intuitions about the scale of logarithms because they physically engaged with them all the time. I spend lots of time looking at log-likelihoods, log-odds-ratios, and logistic regression coefficients, so I think it would be nice to have quick intuitions about what they are. (Though the Gelman and Hill textbook has an interesting argument against odds scale interpretations of logistic regression coefficients.) The second one are some zsh filename manipulation shortcuts . OK, this is more narrow than the others, but pretty useful for me at least. The third one are rough unit equivalencies for data rates over time. I find this very important for quickly determ

5 0.41180772 195 brendan oconnor ai-2013-04-21-What inputs do Monte Carlo algorithms need?

Introduction: Monte Carlo sampling algorithms (either MCMC or not) have a goal to attain samples from a distribution.  They can be organized by what inputs or prior knowledge about the distribution they require.  This ranges from a low amount of knowledge, as in slice sampling (just give it an unnormalized density function), to a high amount, as in Gibbs sampling (you have to decompose your distribution into individual conditionals). Typical inputs include \(f(x)\), an unnormalized density or probability function for the target distribution, which returns a real number for a variable value.  \(g()\) and \(g(x)\) represent sample generation procedures (that output a variable value); some generators require an input, some do not. Here are the required inputs for a few algorithms.  (For an overview, see e.g.  Ch 29 of MacKay .)  There are many more out there of course.  I’m leaving off tuning parameters. Black-box samplers:  Slice sampling ,  Affine-invariant ensemble - unnorm density \(f(x)\

6 0.39868748 183 brendan oconnor ai-2012-04-11-F-scores, Dice, and Jaccard set similarity

7 0.34035009 182 brendan oconnor ai-2012-03-13-Cosine similarity, Pearson correlation, and OLS coefficients

8 0.30461144 166 brendan oconnor ai-2011-03-02-Poor man’s linear algebra textbook

9 0.2968896 178 brendan oconnor ai-2011-11-13-Bayes update view of pointwise mutual information

10 0.29102901 100 brendan oconnor ai-2008-04-06-a regression slope is a weighted average of pairs’ slopes!

11 0.28138334 175 brendan oconnor ai-2011-09-25-Information theory stuff

12 0.27933562 95 brendan oconnor ai-2008-03-18-color name study i did

13 0.24898045 179 brendan oconnor ai-2012-02-02-Histograms — matplotlib vs. R

14 0.24001829 185 brendan oconnor ai-2012-07-17-p-values, CDF’s, NLP etc.

15 0.23622124 136 brendan oconnor ai-2009-04-01-Binary classification evaluation in R via ROCR

16 0.23336671 40 brendan oconnor ai-2006-06-28-Social network-ized economic markets

17 0.20429976 6 brendan oconnor ai-2005-06-25-idea: Morals are heuristics for socially optimal behavior

18 0.19898421 149 brendan oconnor ai-2009-08-04-Blogger to WordPress migration helper

19 0.18362904 186 brendan oconnor ai-2012-08-21-Berkeley SDA and the General Social Survey

20 0.17818755 68 brendan oconnor ai-2007-07-08-Game outcome graphs — prisoner’s dilemma with FUN ARROWS!!!


similar blogs computed by lda model

lda for this blog:

topicId topicWeight

[(26, 0.753), (44, 0.025), (74, 0.044)]

similar blogs list:

simIndex simValue blogId blogTitle

same-blog 1 0.98172385 201 brendan oconnor ai-2013-10-31-tanh is a rescaled logistic sigmoid function

Introduction: This confused me for a while when I first learned it, so in case it helps anyone else: The logistic sigmoid function, a.k.a. the inverse logit function, is \[ g(x) = \frac{ e^x }{1 + e^x} \] Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression). The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.) \[ tanh(x) = 2 g(2x) - 1 \] It’s easy to show the above leads to the standard definition \( tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}} \). The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot. The two functions are plotted below. Blue is the logistic function, and red is tanh.

2 0.064880855 123 brendan oconnor ai-2008-11-12-Disease tracking with web queries and social messaging (Google, Twitter, Facebook…)

Introduction: This is a good idea: in a search engine’s query logs, look for outbreaks of queries like [[flu symptoms]] in a given region.  I’ve heard (from Roddy ) that this trick also works well on Facebook statuses (e.g. “Feeling crappy this morning, think I just got the flu”). Google Uses Web Searches to Track Flu’s Spread – NYTimes.com Google Flu Trends – google.org For an example with a publicly available data feed, these queries works decently well on Twitter search: [[ flu -shot -google ]] (high recall) [[ "muscle aches" flu -shot ]] (high precision) The “muscle aches” query is too sparse and the general query is too noisy, but you could imagine some more tricks to clean it up, then train a classifier, etc.  With a bit more work it looks like geolocation information can be had out of the Twitter search API .

3 0.063158371 203 brendan oconnor ai-2014-02-19-What the ACL-2014 review scores mean

Introduction: I’ve had several people ask me what the numbers in ACL reviews mean — and I can’t find anywhere online where they’re described. (Can anyone point this out if it is somewhere?) So here’s the review form, below. They all go from 1 to 5, with 5 the best. I think the review emails to authors only include a subset of the below — for example, “Overall Recommendation” is not included? The CFP said that they have different types of review forms for different types of papers. I think this one is for a standard full paper. I guess what people really want to know is what scores tend to correspond to acceptances. I really have no idea and I get the impression this can change year to year. I have no involvement with the ACL conference besides being one of many, many reviewers. APPROPRIATENESS (1-5) Does the paper fit in ACL 2014? (Please answer this question in light of the desire to broaden the scope of the research areas represented at ACL.) 5: Certainly. 4: Probabl

4 0.062325791 63 brendan oconnor ai-2007-06-10-Freak-Freakonomics (Ariel Rubinstein is the shit!)

Introduction: I don’t care how lame anyone thinks this is, but economic theorist Ariel Rubinstein is the shit. He’s funny, self-deprecating, and brilliant. I was just re-reading his delightful, sarcastic review of Freakonomics . (Overly dramatized visual depiction below; hey, conflict sells.) The review consists of excerpts from his own upcoming super-worldwide-bestseller, “Freak-Freakonomics”. It is full of golden quotes such as: Chapter 2: Why do economists earn more than mathematicians? … The comparison between architects and prostitutes can be applied to mathematicians and economists: The former are more skilled, highly educated and intelligent. To elaborate: Levitt has never encountered a girl who dreams of being a prostitute and I have never met a child who dreams of being an economist. Like prostitutes, the skill required of economists is “not necessarily ‘specialized’” (106). And, finally, here is a new explanation for the salary gap between mathematicians and eco

5 0.062225286 26 brendan oconnor ai-2005-09-02-cognitive modelling is rational choice++

Introduction: Rational choice has been a huge imperialistic success, growing in popularity and being applied to more and more fields. Why is this? It’s not because the rational choice model of decision-making is particularly realistic. Rather, it’s because rational choice is a completely specified theory of human behavior , and therefore is great at generating hypotheses. Given any situation involving people, rational choice can be used to generate a hypothesis about what to expect. That is, you just ask, “What would a person do to maximize their own benefit?” Similar things have been said about evolutionary psychology: you can always predict behavior by asking “what would hunter-gatherers do?” Now, certainly both rational choice and evolutionary psychology don’t always generate correct hypotheses, but they’re incredibly useful because they at least give you a starting point. Witness the theory of bounded rationality: just like rational choice, except amended to consider computational l

6 0.061973974 105 brendan oconnor ai-2008-06-05-Clinton-Obama support visualization

7 0.061454009 77 brendan oconnor ai-2007-09-15-Dollar auction

8 0.060616948 19 brendan oconnor ai-2005-07-09-the psychology of design as explanation

9 0.058354188 152 brendan oconnor ai-2009-09-08-Another R flashmob today

10 0.058315024 138 brendan oconnor ai-2009-04-17-1 billion web page dataset from CMU

11 0.055383153 129 brendan oconnor ai-2008-12-03-Statistics vs. Machine Learning, fight!

12 0.054933697 86 brendan oconnor ai-2007-12-20-Data-driven charity

13 0.054315742 53 brendan oconnor ai-2007-03-15-Feminists, anarchists, computational complexity, bounded rationality, nethack, and other things to do

14 0.053873405 188 brendan oconnor ai-2012-10-02-Powerset’s natural language search system

15 0.053635284 179 brendan oconnor ai-2012-02-02-Histograms — matplotlib vs. R

16 0.053431518 2 brendan oconnor ai-2004-11-24-addiction & 2 problems of economics

17 0.052980348 150 brendan oconnor ai-2009-08-08-Haghighi and Klein (2009): Simple Coreference Resolution with Rich Syntactic and Semantic Features

18 0.052366126 6 brendan oconnor ai-2005-06-25-idea: Morals are heuristics for socially optimal behavior

19 0.052110419 184 brendan oconnor ai-2012-07-04-The $60,000 cat: deep belief networks make less sense for language than vision

20 0.051230647 198 brendan oconnor ai-2013-08-20-Some analysis of tweet shares and “predicting” election outcomes