Today’s guest post is written by Dustin Fife who holds a doctorate in statistics. He’s also written a few science fiction novels as well. What he has done with this post is sheerly amazing. He used his statistical powers for good and deeply analyzed the money earning potential of self-publishing versus traditional publishing. If you are still on the fence about which route to go with your current work in progress, then read Dustin’s expert advice below:
What’s more lucrative? Traditional or self-publishing?
The results may surprise you…. 🙂
Below, I’m going to flex my statistical muscles. Using the Amazon bestseller list, I’ll show you which publishing route tends to bring home the big bucks.
- The data
- Hugh’s methods versus mine
- How many authors make money
- Indie versus Amazon versus Big Five
- What really matters in publishing
See, here’s the thing. Of late, aspiring writers have been faced with a dilemma–do I self-publish or do I pursue the traditional publishing route? It used to be that there was only one choice. But lately, self-publishing has gained some traction as a viable alternative to the traditional route. And so, what many wanna know is this–will I make more money self-publishing than going through traditional publishing channels?
The short (and unfortunate) answer is–um….er….it’s kinda difficult to say definitively. To know, we’d need data and all the data we have is less than ideal (see here and here). Some data collected from the traditional publishers suggests ebooks are on the decline and there’s no competition–traditional publishing rules. But others, notably Indie author Hugh Howey have data that suggest otherwise.
Or maybe not.
Or maybe, we don’t know.
(How’s that for a definitive answer!)
Book sales and author earnings are a somewhat closely guarded secret. But, with some fancy pantsy computer scraping and by gathering data from other sources, Hugh and his elusive data wonk have compiled the Holy Grail of information–they have pulled and compiled information from Amazon’s bestseller list.
Stats nerds haven’t seen such an outpouring of data awesomeness since the Netflix competition. When I heard of the Amazon data, I salivated from my statistical glands. And I had to get my hands on it.
And now I do.
So what is the Amazon dataset? Well…see for yourself. It’s the “hourly updated” list of what is currently “hot” in the bookselling world. Each book contains information about publisher, number of ratings, average rating, and most importantly….
Here’s where the guesswork comes in. We don’t know exactly how to translate an Amazon rank into a metric you and I are familiar with, like money and book sales. But we have some guesses. Fortunately for us broke authors, some hotshot authors noticed that, ‘hey, well, what do you know? I’m #1 on Amazon!’ They then reported the number of books they’ve sold. So, if the #1 author reported xxxx number of sales, we can extrapolate and assume that the current person at slot #1 is also selling about as many books (see this). We also don’t know the “cut” each author gets, but we can guess by, again, using other sources from the dubya, dubya, dubya.((By far, the largest limitation of this dataset is that it’s only a snapshot. A book that sold millions six years ago might be selling nothing now. In this dataset, the “salary” of the author would be zero. I’ve attempted to make the data more relevant by only looking at novels published within the last three years. But still…it doesn’t entirely overcome the issue. More on that later.))
So, with that, I’m going to try to answer that question–am I better off self-publishing or should I pursue the traditional route?
Why yes…yes it has! Hugh and his data jedi have produced quite detailed reports.
So, Mr. smarty-pants, why do we need another???
Good question. The fortunate thing about data like this is that there’s a ton of questions that could be asked. Hugh asked his and now I have questions of my own.
Not only do our questions differ, but our means of approaching the same question may differ.
Example–the report has previously reported the total author sales for authors who Indie pubbed versus those who pubbed through the “Big Five.” Two problems. First, sums (and averages) are highly sensitive to outliers. Suppose I’m interested in the average salary of Americans and happen to sample Bill Gates. Would you be surprised to learn that the average salary of Americans, based on my poll, is 16 million dollars! Talk about the land of the free and the home of the bling-bling!
Nope. Wrong. In this case, it’s better to compute the median (or just transform the data).
Second problem–simpson’s paradox and other such deceptive features of aggregate level data. Huh? Simpson’s what? Let me explain. Suppose I’m interested in studying gender discrimination at a university. I sample men. I sample women. Compute a mean. Botta-bing, botta boom–HUGE discrepancies. Now, suppose another researcher comes along and attempts to answer the same question by computing means for men and women within each department. So, male professors from the business department are compared to women professors from the same department; male professors from the education department and female professors from the education department.
The mean difference suddenly vanishes.
Here, the first dude was conflating male-dominated departments with gender itself. But once you “control” for department, men and women are equal.
(By the way–this was just an example. I actually have no idea if there are or are not gender differences).
So how does this relate to the Amazon dataset? Well, it may be the case that there are whopping differences between Indie and Big Five authors, but that difference may be due to something other than publishing avenue. (More on this later).
So, with that exceptionally long introduction, let’s get to work.
Just as a fun little exercise, I’m going to randomly sample an author’s salary and assume said author is your clone–just to give you an idea of what you’ll likely make. Ready……
Sorry. You made nothing.
Ok. Bad draw. Let’s try again.
This could go on for a while. But, out of curiosity, I decided to compute how many authors made nothing:
In other words, only 20% of authors actually made something. Now, remember that Amazon’s rank score is a measure of how the book is doing right at this moment and maybe a little bit into the future((I’m guessing Amazon has some sort of a time-series model, like an ARIMA. Such models tend to highly weight yesterday’s data, weight a little less the data from the day before, and so on. This sort of modeling allows you to predict what things will look like in the future, based on past information. Also, as I mentioned earlier, this is a serious limitation of the data.)). So, for 80% of authors, Amazon predicts that they’ll make nothing now and won’t make squat in the near future.
Well that’s depressing!
But how does this compare across the several avenues of publishing?
Proportion of Authors Who Currently Make Money
Big Five((Big Five authors (and small publishing authors) tend to get advances and that’s obviously not factored into this statistic. But again, these values represent what authors are making currently, or in the near future)): 37%
Small Publisher: 15%
Iiiiiinteresting. Amazon smokes everyone else in this category. 70% of their authors sell something (at least, sell something right now or in the near future).
As for Indie publishing….well, it certainly isn’t besting the big five!
Or are they….[evil laugh].
For the rest of the analyses, I’m going to ignore the people that make nothing. I’m going to assume that you all don’t want to hear about them. The statistician in me cringes at the idea of tossing data((It’s a “missing data” problem when you omit observations. In such a case, the results are “biased” if we wish to generalize our findings beyond the “selected” sample. However, I’m not really interested in generalizing my results to those who make nothing.)), but it also makes it a whole lot easier to do the rest of the analysis((I actually transform the dependent variable with a log transformation. If the zeros are included (after rescaling, of course), I get a bimodal distribution and that really mucks things up. So, I just stuck with those authors who had more than zero.)).
The non-zero heroes
Let’s go back to that question–will I make more if I Indie publish or if I land a contract with the Big Five?
That problem is inherently difficult to answer. In all likelihood, you and I will make nothing. But suppose you do? Which course will bring more income? Again, it’s hard to answer because it depends so heavily on the quality of your book.
Now suppose we could pit the most popular Indie book against the most popular Big Five? Or suppose we could pit the second most popular indie versus the second most popular big five? (And so on down the line).
We’d get something like this…
Let me explain. Each dot represents a rank. So that little dot in the top right represents the #1 ranked person. If we follow the dot down to the x-axis, that’s how much the top ranking indie author makes. If we follow the dot left to the y-axis, that’s how much the top ranking Big Five author makes. If the dot falls on the line, the two make the same.
You with me?
The second dot is kinda sorta like the second place indie versus Big 5, and so on.
Dots above the diagonal mean that the Big Five peeps are making more than the Indie peeps (at that particular rank). Dots below the line mean the Indie peeps are making more than the big five peeps (again, at that particular rank).
What does this tell us? It says that the top 10% or so of Big Five authors are making way more than the top 10% of indie authors.
What about the rest of the lot? Well, it’s kinda hard to see, so let’s zoom in.
Interesting. So, the Indie authors tend to dominate the middle and lower tax brackets. Put differently, if your book is going to make millions, you’re probably better off as a Big Five author (cuz you’d probably make 6 mil instead of 2 mil). If your book ain’t gonna make that much, you’re better off Indie-ing.
What about Amazon publishers?
Now that’s fascinating! Except for the #1 ranking, everywhere else, you’re better off going with Amazon publishing. For example, the third((I’m oversimplifying things here. Each dot represents a percentile. The third dot is the third percentile of the data, not the third place person. If there were exactly 100 authors in each group, the two would be identical)) place Amazon person makes about six hunned thou, while the third place Big Five person makes about two hunned thou.
I know, I know. You’re probably saying, “Hey, dude! What about that Simpson’s paradox thingey?”
Right. That. So, maybe Amazon published peeps are making more because the Big Five are charging way more for books? Or perhaps we’re comparing apples and oranges here. Maybe the Big Five books are so much higher in quality that it doesn’t make sense to compare the best Big Five to the best Indie. Maybe the two are incomparable, like pitting the best NBA player to the best 3rd grader in Pawnee Oklahoma! In other words, maybe the differences in salary are because of differences in quality. If all Indie authors (as some dogmatic jerks suggest) are merely the dregs of the Big Five–those who couldn’t even find an agent–maybe that discrepancy is because of talent, not publishing venue.
You’ve got a point there.
But how do we “control” for that?
Controlling for “cost” is easy–we just include the price of the book in the model.
But book quality?
That’s a toughey. I can “control” for the average rating. If we assume that books with, say, at 4.2 for Indie are of the same quality as a 4.2 book for the Big Five, we can use some statistical ninja-ry to account for that. But here’s two problems with that: (1) Indie authors may be more likely to push friends and families to inflate their ratings, or (worse) pay others to inflate them (a point made by Hugh). Well darn. Or, (2) Big Five books may be held to a higher standard than Indie books. In either case, a 4.2 for an Indie book may mean something totally different than a 4.2 for a Big Five (another point made by Hugh).
But what are you gonna do?
Well, we can kinda sorta control for the authors.((I actually performed an analysis of those authors who published both with Amazon and with the Big Five, or with Indie and the Big Five. Unfortunately, the sample size was teeny and nothing reliable could be gleaned from it.)) Surely the difference in quality for the same author is minimal. (Although we all have read multiple books by the same author with vastly different qualities…but I’m hoping that’s the exception, rather than the rule.).
So I’m going to control for it. ((I actually reran the analysis without controlling for it and it didn’t make a particle of difference. But…I’m unsure whether including a “random author” effect (in statistical nomenclature) actually “controls” for author talent. I know if I included it as a fixed effect, it would, but I’m not entirely sure if the same hold for random effects, particularly since some authors only have one book in the dataset. I posted a question on stackexchange and haven’t heard a reply. I’ll keep ya’ll posted if that changes.)).
So with that, here’s the differences between the different publishing paths (controlling for book rating, age of the book, and book price)((The model fitted was a random-intercepts model where authors were modeled as a random effect. I included as predictors (fixed predictors) book price, age of the book (in months), Publisher (obviously), average rating, and the number of books written. I also included a polynomial term for book price since there seemed to be a slight curvilinear relationship between book price and number of sales.))
Amazon still comes out on top and by a whopping margin. In addition, even if we don’t assume that rating quality controls for book quality, it won’t matter. Notice that the worst Amazon book still outperforms the best Big Five book (on “average’–i.e., a log-transformed average). In other words, the highest rated Big Five books do worse than the lowest rated Amazon books, on average.
And that, my friends, is why reporting means alone can be disastrously deceiving!
Ya’ll saw it. Amazon tends to come out on top, but even the best Amazon-pubbed authors, with the highest Amazon ratings are only predicted to make about 10K a year. Yikes! Hard to feed a family on 10K.
So how do you explain the people that “break out?” It’s obviously not publishing venue!
It turns out, there’s a way to answer that statistically with a cool little statistic called the intra-class-correlation (or ICC). The ICC tells us, in stat-ese, “the proportion of variability explained by xxx.”
In less erudite terms (and using our example)–the ICC tells us how much of a role author talent/reputation plays in determining your salary (after controlling for publisher, age of book, etc.).((The ICC is computed by dividing the author variance (.7411) by the residual variance (.6961) plus the author variance.)) Pretty neat, eh?
And the answer is…
Wow! Your talent and reputation account for a whopping 51% in determining your salary. So what’s the difference between the Amazon authors who make 10K versus 1mil? Well, 51% of that is because of the author.
And what about the rest?
So, I guess the best advice I can offer at this point is this: work on your craft. Write a dang-good book. Pray to the gods of luck. And after that? You’re best bet is probably publishing through Amazon.
- You’re most likely going to make nothing in the near future. (How insanely depressing).
- Amazon tends to have the highest proportion of authors who make money.
- Amazon tends to best the others (big five and indie) in all “tax brackets.”
- After controlling for “book quality,” Amazon still does way better than the other two.
- Author reputation/talent and luck play a far greater role in determining your salary than does publishing venue.
And that’s “science” for ya.
But stay tuned. There’s a wealth of data and a wealth of potential questions to ask.