Fan Fiction Statistics - FFN Research

Friday, 18 March 2011

Fan Fiction Demographics in 2010: Age, Sex, Country

Preamble

There is an evident vacuum when it comes to information about online fiction writers and readers. Forget trying to find data on who these people are via Google or Bing. You can’t set your expectations high even if you have access to subscription-based academic databases such as EBSCO or Emerald Insight: the content You need is missing. Several researchers I have come in contact with ceased their attempts to find reliable data on the internet, citing a poor information environment. A reasonable choice. It is far easier to work with enthusiasts of other activities, such as sports, as the information is present in ample amounts. Fan fiction is a different story.

What I’ve found in secondary sources? Limited traffic statistics with no raw numbers (credits to Alexa), literary fannings about modern culture and summary demographics about Major League Baseball fans. For contrast, there was a three-page essay from The Gay & Lesbian Review by Marianne MacDonald about Harry Potter fan fiction, a broad complaint on limited success with samples no greater than 10 people. Oddly enough, the article’s author attempted to draw conclusions from such data. The closest to fan fiction I’ve gotten from aca-fen (academics as fans) was a series of essays entitled Fan Fiction and Fan Communities in the Age of the Internet. No empirics there.

In retrospect, the only available information was qualitative, limited to a small community or plain useless for a more general study. Ergo, FFN Research needed to befriend the DIY framework. The result of this choice has been a 20 MB Excel spreadsheet with numeric and non-numeric data.

To preserve time and space, the study will be presented in parts. This also gives you, dear readers, an opportunity to comment and make suggestions for the next part. Your opportunities are practically limitless at this point. If at least half the variables undergo regression analysis, there will be no less than three parts in total.

Take your time.

End Preamble

INTRODUCTION

The goal of this release is to provide empiric data and analysis on fan demographics and interests on the fan fiction writing site, FanFiction.Net. The research deals with basic demographic data such as age, sex and country of residence of registered FanFiction.Net members in relation to their public writing activities in 2010. Based on a quantitative approach, the research should provide guidelines for future studies.

METHODOLOGY

FanFiction.Net user profiles are the main source of empiric data in this research. With nearly three-million registered users, FanFiction.Net is the largest hub for fan fiction writing communities, the largest archive of fan fiction with an excess of 6,600,000 registered titles as of March, 2011, and a trend-setter for fan fiction as a phenomenon. In addition, it is a site that challenges Facebook in the amount of time spent browsing within the domain.

95,313 public profiles of registered members created in the year 2010 were analysed. A yearly study was the most feasible choice in current-day unlike anything dedicated to a larger period of time with progressive scarcity of chronological data. Fieldwork took place from January 27, 2011 till February 11, 2011 and empiric data reflects the state of public member profiles between those dates.

The said profiles were chosen from the total of 443,400 accounts created in the year 2010 as clusters. Accounts created between the first and the seventh day of every month are present in the analysis. This was a preventative measure taken to mitigate seasonal fluctuations (see picture 1) and assure every seven-day period is all-inclusive in relation to heterogeneity within a cluster. Please note that the chosen periods reflect weekly fluctuation cycles, each representing the whole cycle.

Picture 1. Aggregated pageviews on FanFiction.Net, September 2010-March 2011

The sample of 95,313 is further explained by uncertainty and a lack of prior empiric studies. Smaller samples of 1100, sufficient for error margins as low as 3% would have provided inaccurate data upon splicing into categories. As it was impossible to find comparisons, a safer approach retains. The current conditions, verified by Raosoft, allow for a 0.37% margin of error at a 99% confidence level.

Acknowledging possible difficulties in attaining the necessary information via surveying or other applicable means, data mining was chosen as the only feasible collection method. Quantitative data later underwent a 5% reduction on extremes to mitigate outliers. This also dodged factual inaccuracies such as account holder ages reportedly being 1000 or two. A detailed explanation of variables is included in the definitions part.

Descriptive statistics are a cornerstone of this research.

DEFINITIONS

The following factors are provisions for analysis. Each is defined below. In addition, we explain other terminology unique to this research.

FanFiction.Net – the largest fan fiction writing website in the world. Also referred to as FF.Net and FFN, the site, the domain.

Fan – any FanFiction.Net account holder.

Fandom – any series, TV show or title present as a category for fan fiction uploads on FFN. Also, a group of FanFiction.Net account holders, who uploaded fan fiction to FFN.

ID – unique public member profile and account identification number, assigned to every

FanFiction.Net user upon signing up. The number of accounts a user may create is not limited by the domain’s ToS.

Pen name – a pseudonym taken by the account holder. Every account holder is required to have an unique pseudonym upon signing up.

Country – the country of residence/access during a browsing session on FanFiction.Net reported by the account holder’s ISP or proxy at the time of data collection. Users have the ability to disable public display of their country.

Age – self-identified account holder’s age displayed in English on the public profile. Users are not provided any extra facilities to display their age, nor does FanFiction.Net collect specific age data upon signing up. Internet users wishing to hold an account on the site only stipulate they are aged 13 or older.

Sex – self-identified account holder’s sex (gender) displayed in English on the public profile. Users are not provided any extra facilities to display their sex, nor does FanFiction.Net collect any data about sex.

Avatar – graphic uploading service, which allows registered users to be associated with one picture in three square formats: 150x, 75x and 50x.

Profile length – cumulative size of a public profile starting from below the space reserved for the account’s pen name to the end of the profile table’s space. Does not include lists such as “Favorite Stories”.

Beta Reader – an account with a qualified Beta Reader portfolio. Beta Readers provide editorial services to fan fiction content before it is made public.

Story count – the number of separate public fan fiction uploads made by an account holder.

Fandom count – the number of separate fan fiction categories with at least one fan fiction title made public by the account holder.

ANALYSIS (PART ONE)

COUNTRY OF RESIDENCE

In 2010, accounts on FanFiction.Net have been made and accessed by people in at least 173 countries, from Afghanistan to Zambia. A full list of countries is available here. The figure may, in reality, be larger as the domain uses a self-served set of definitions for a country and recognizes non-standard internet service properties as separate countries. These include:

-Satellite Connection Providers “Satellite Provider” – direct device to commercial satellite connections impossible to trace to any specific country. Common in the Middle East. Falsely recognized as a country with a flag of its own.

-Encrypted Anonymous Proxies “Anonymous Proxy”. Falsely recognized as a country with a flag (“Jolly Roger”) of its own.

-The country of Europe “Europe” – government institutions and their encrypted networks, not reporting to belong to a specific EU country. Falsely recognized as a country with a flag (EU flag) of its own.

-“Asia/Pacific Region” – umbrella term for any of the small islands not recognized by FFN’s technology and other territories applicable. Falsely recognized as a country with a flag of its own.

Satellite Providers, Anonymous Proxies and Europe (country) do not participate in further analysis, effectively reducing the sample to 95,219 accounts. Asia/Pacific Region remains, as it behaves like a legitimate umbrella zone. In years prior to 2010, there were reports of registered fan fiction writers or readers hailing from Antarctica.

However, the biggest issue in establishing sense in the quantitative data is the fact 25,297 accounts in the sample did not have a publicly specified country. FFN’s system was not allowed to disclose such information. This translates into 110k of accounts not having permitted access to the data or 24.8% of accounts made in 2010. It may explain why certain countries, such as North Korea, are not present in the study. Regardless, 75.2% of all accounts holders joining in 2010 allowed the site to display their country of access.

To make results readable, FFN Research decided to put forward a 0.5% threshold. For a country to be included in the analysis, at least half of a percent of all accounts with a specified country had to originate from it. This translated into 0.5% of 69,500.

As a result of the entry limit, the number of accounts involved slid by 11% to 62,559 with 22 countries involved. You can see it in picture 2. This is an accurate portrayal of an informal 90/10 rule of thumb with 90% of accounts being accessed/created within 10% of all countries listed. Only one out of ten accounts is created/accessed outside the regions drawn in picture 2.

Picture 2.

57% of the 62,559 or 35,361 user accounts were reported as being from the USA, the only country to score more than 10,000. The second biggest contributor of accounts is the UK with 9.2% (5739) originating from the country. Canadian users are third in the rank with 5.6% (3513). This is supported traffic-wise by FFN’s partnerships with large ad networks, which require at least 50% of site traffic to come from the USA, UK and Canada. For readers interested in accurate portrayal of accounts of non-US users, look below for picture 3. The percentage in picture 3 is displayed as a part of 27,198, which we get upon subtracting USA accounts.

Picture 3. FanFiction.Net Member Composition by Country, no USA

For your convenience, there is a list of countries, excluding the USA, ranked by how many accounts originate from the country in question.

1. UK

2. Canada

3. Australia

4. Philippines

5. France

6. Mexico

7. Indonesia

8. Brazil

9. India

10. Germany

SEX (GENDER)

Obtaining information about the account holder’s sex (gender) was more difficult than that of their country. Since the site does not encourage users to disclose such data, only those, who make the explicit choice of doing it, are included in this analysis.

Furthermore, users, who did publicly reveal their sex on the profile, did so in various means and different languages. While the first is a technical issue possible to alleviate with the use of a specially-crafted regex, the latter is a serious obstruction. FFN Research did not have the resources to include gender specifications in languages other than English. Using online translation tools could have had uncontrollable accuracy faults. This, fortunately, was not necessary due to country data already discussed.

Since users access their accounts from the USA (57%), UK (9.2%), Canada (5.6%) and Australia (4%), the cumulative majority (75,8%) of registered users is assumed to be English-speaking. Whether the language is a mother tongue or a foreign one does not matter in this research. The possibility of having non-English profiles (for example, Spanish) created by users from the aforementioned countries would make the figure of analysable content smaller. However, the effect should be compensated by profiles written in English by members hailing from other countries.

The result was 9544 user profiles with gender identity disclosed. For 2010, this means that 10% of FanFiction.Net members reveal their sex in the profile. This called for a 1.1% (1.3% for 50/50) margin of error at a 99% confidence level.

This data was initially broken into two uneven parts (5005 and 4539) to spot any structural differences between accounts created in the first and second half of 2010. The disclosure rate in the second half of the year was lower than that of older accounts, and the difference was statistically significant (2.4%). FFN Research offers a reasonable experience-driven explanation: there is a time lapse between creating an account on the site, writing a profile and putting one’s personal details on the profile. The explanation is supported by the fact FanFiction.Net enforces time thresholds for when a new registered member may start using a particular service.

Gender distribution, on the other hand, did not have a statistically significant difference in both parts of our sample. The female/male ratio was stable in our sample and stayed remarkably close to rumours that 80% of the site’s users are female.

The sample revealed that 78% of FanFiction.Net members are female, provided they joined in 2010. The remaining 22% self-identify as male. Picture 4 illustrates these. In addition, here is a gender ratio graph.

Picture 4. FanFiction.Net Members in 2010 by Sex (Gender)

MEMBER AGE

Age statistics on FanFiction.Net were the most challenging to attain. Less registered members disclosed their age publicly. 6410 people appeared to have included the precise information in our sample.

There were incidents of users reporting to be one year of age, and ninety-nine years old. A small part of those, who discussed their age on the public profile included unreal ages, guessing challenges or offered an age range as wide as 20. Such data points are disqualified from the research.

2230 members with the account holder’s age present on the profile have only identified themselves as teenagers or teens. On assumption that registered members define “teens” as ages 13-17, FFN Research distributed the 2230 proportionally among relevant ages. For reference, you may view the age distribution prior to this choice, with a smaller sample. Note that ages beyond 55, all with single instances, are cut from the graph to make it more compact.


Picture 5. Age distribution on FanFiction.Net in 2010, post-processing with percentages

Looking at picture 5, we see that 80% of those, who have revealed their age, are between 13 and 17 years old. In a normal large population, this allows us to expand onto the entirety of people registering as members in 2010.

The average age is 15.8. The median age is 15, and the mode is 14 years of age. The graph’s shape is a good explanation for why these three values are different. The highest point (modal) does not endure symmetrical surroundings, but both arms of the parabolic shape have bumps at the side, corresponding to ages 10 and 19. From a descriptive statistics perspective, these are anomalous and can be interpreted as missing data.

It is particularly acute on the left end of the spectrum with younger people. No doubt, there are children below the minimum allowed age of 13 on FanFiction.Net. They make up a very small portion of the community, and seem to have an understanding that they should not make their age public. Eleven-year-olds appear to be the most knowledgeable in this respect. Aged 12, it is plausible they see the legal margin approaching, so there is no perceived harm in a premature disclosure. Had all the users disclosed their data, the point at 11 would have allowed for an opposite shape, not a dip.

No doubt, with all reputable sources repeating the notion, FanFiction.Net itself stipulating various services should be “suitable for teens”, the site is less appealing to older users. It is, therefore, natural to consider a much lower registration rate among adults. The downward trend acts as expected, save for a few small waves further on the right. These, interestingly, have a period of 5 years minimum, with an apex at 35 and 45. From a purely human perspective, understanding middle-aged people on FanFiction.Net have interests in fiction, fan or otherwise, FFN Research suggests a quote by Lady Bracknell from Oscar Wilde’s The Importance of Being Earnest: “35 is a very attractive age. London society is full of women of the very highest birth who have, of their own free choice, remained 35 for years.” Among greater privacy concerns, the increase in certain ages could be explained with one’s social policy.

With regards to age distribution among sexes, 2240 account holders have provided such information on one user profile. 79% of accounts with the data made available were female. This leads to the conclusion that disclosure is proportional to the size of a population. The site’s usage among teenagers and adults, considering either gender separately, was similar to the cumulative distribution. Ergo: no matter the age, teenage or adult, the male/female proportion stays the same as in the general population.

For pre-teens, the composition is 81.5% female, 18.5% male, which is the most extreme proportion for any age group, but there is no statistically significant deviation from the general 78%.

BETA READERS

In the sample of 95,313, there were 560 users with a beta reader profile, which makes up 0.6% of the sample. In a website with under 50,000 beta readers, this would translate into 2608 beta readers with an account created in 2010. Those joining in 2010 make up approximately 5% of all beta readers. The majority of beta readers have stayed on FanFiction.Net longer than a year.

419 (75%) of beta readers joining in 2010 have revealed their sex. 345 were female, which accounts for 82% of those, who made the disclosure public. There is no statistically significant difference between the gender distributions of beta readers in relation to the general population of those joining in 2010 at a 99% confidence level.

79% of all beta readers are users aged 13-17. The modal beta reader age is 14 while the average and median ages are 15.

STORY COUNT

In our sample, there were 64,484 stories submitted (300,500 in the total pop.). The average number of stories submitted by a user with at least one story was 2.9 with 22,023 accounts containing stories. Interestingly, the maximum number of stories was higher among those, who joined in the second half of the year, 88. At the time of writing, the number of stories written by the user in question has gone up to 94. The person wrote for 22 fandoms in 2010, and was most attracted to Everybody Loves Raymond, Hannah Montana and iCarly.

Among users with a disclosed gender, 78% of all stories were written by females. This shows that there is no gender influence to the number of stories written on FanFiction.Net among users, who joined in 2010.

Age, however, portrays a differentiation in the average number of stories per age group, For teenagers, the average number of stories written, for accounts with at least one story, is 4. The highest number is among those aged 20-30, up to 12 stories as an average.

END PART ONE

FFN Research welcomes your views and comments.

Thursday, 10 March 2011

Upcoming FFN Research Update

FFN Research is getting ready for the greatest update yet. For the first time in the history of FanFiction.Net you will see a full picture of fandom demographics with an error margin as low as 0.3% site-wide.

It has been a while since the first petitions to provide this information have reached FFN Research, but the data is present, and only post-processing remains.

Stay tuned!

Tuesday, 11 January 2011

FanFiction.Net Fandoms: Story and Traffic Statistics

FanFiction.Net Statistics include all fandoms in this analysis. The amount of data collected on January 1, 2011 is enormous, and we now have the ability to compare how each fandom registered within FanFiction.Net grew since our first release.

We start with the basics and site-wide descriptive statistics before entering top-level categories (Anime, Books, Games et cetera) and delving into individual fandoms. This research paper involves not only the biggest fandoms, but also more obscure, yet dynamic communities. Comparative charts and future growth predictions are presented to illustrate the trends. Off-site sources aid the study in audience profiling and global fandom trends. Due to the volume of analysis, it is presented in several consecutive posts to save scrolling space and browsing resources.

The goal of this release is to present you every category’s health check, trying to find the most resilient top-tier category of 2010.

Warning: Reading this text may take some time, so you can do it in parts. Everyone can post a comment. FFN Research has a new release in store, but it’s always nice to know what questions interest individual readers. The text was written to be simple enough for ages 13 and up, but if something confuses you or you find an error, please notify.

BASIC INFORMATION

FanFiction.Net has 5879 fandoms (series/categories). These fandoms contain 3,744,842 stories.

The site houses 621 fandoms more than in July 15, 2010, or 1368 new fandoms since the end of 2009. 2010 created 23% of all fandoms you see now, and it was a 30% increase in total fandoms since the previous year.

An average fandom on FFN has 637 stories. A median fandom has 14 stories (69 fandoms have 14 stories), which is two stories less than six months ago. The mode fandom has 1 story (793 fandoms have only 1 story).

Below, you see how top categories fared in 2010, ranked by size. The biggest winners and losers are highlighted.

Anime/Manga has been the largest contributor in 2010, responsible for 27.6% of growth site-wide. The category gained almost 18% of new fanworks this year. Cartoons, along with TV, had more than a 25% increase in-category. Plays have shrunken by 12.7%, but their minuscule weight on FFN overall has totaled to only a -0.4% decrease in the annual growth. Nonetheless, it is strange to see a whole top-level category lose weight throughout the year. No media categories shifted in rank since the beginning of 2010.

In total, FFN grew by 20% in 2010 and received over 627,000 new story uploads. The site’s account total rose by a similar value.

NEW FANDOMS

Numerous fandoms arrived to the site in 2010, affecting the top media category structure. Below, you see a table with the total number of fandoms in every category on two dates and two story count meters. These are explained as follows along with other columns that may raise questions:

One – the number of fandoms in a category that has exactly one story. It is included to point out how many series can be considered a failed venture that did not generate any attention.

Under ten – the number of fandoms in a category that has less than ten (1-9) stories. It is included to point out how many series can be considered a questionable venture. Communities are very fragile at conception, and any fandom that does not have sufficient backbone in story total may not sustain itself.

It’s possible to provide additional counters like Under 100 or under 1000 on demand.

% of new – the share of new fandoms the media category received, relative to the total of new fandoms.

% of category – the increase of fandoms as a percentage of the Jan 1, 2010 fandom total

The biggest winner and loser are highlighted for you. This table reveals a lot about the health a top category has. While it is impossible to assess the sentiment in a particular series from the information above, the general moods that roamed in media categories throughout 2010 can be seen with ease. While Anime/Manga remains the FFN heavyweight in terms of story count, more than 200,000 stories ahead of the closest rival – Books, the latter is a champion of fandom counts. It is an interesting phenomenon that Books, having more fandoms, has less stories than Anime/Manga.

Before we explain it, let’s have a look at the new fandom loser of 2010, Misc. In 2010, Misc had a marginal value of fandoms. But five times more stories than its closest rival, Plays. Misc also had the lowest number of new fandoms. But all of them grew to have more than 10 stories. Misc, like Plays, should be considered anomalous based on the researcher’s opinion. However, it’s difficult to put them aside as a separate category because they depict extreme trends that occur with extreme values.

Looking at the fandom total, Misc has only 35 fandoms as of January 1, 2010. Plays have almost three times as much, but less than 100. Comics are their closest companion, and cartoons follow. Games are the middle child of Fan Fiction with a jagged transition into top categories: Anime, Book, Movie and TV. Despite these being easy to categorise by the number of fandoms, there are two more perspectives visible in the table.

TV is the top dog of new fandoms. 365 appeared in 2010, and Books are closing the gap at 332. When it comes to attempts in discovering a new driving force, TV and Books take the cake. Games and Anime are the mediocre, leaving the rest far behind. Now, lots of new fandom is not necessarily a good thing because some of them can fail. TV has a lot of new fandoms, but also a lot of failing series. In fact, the number of fandoms with one story has doubled in TV, relative to the category size. Without this perspective, we see it bright as day that TV has 49 one-fic fandoms in the beginning, and 155 in the end of the year. Talk about a crippling failure rating. The numbers can be even more frightening when you consider the possibilities of past movements in fandoms. A rise from, say 20, to 49 is not as precipitous as what we see now.

6 out of 9 categories have more questionable fandoms in 2011 than they did in 2010. 755 Book communities have less than ten stories. That is more than half of the total number of fandoms in Book. The situation is similar in other categories large by story count like Anime, TV and Movies, but not Games. Once again, Games position themselves in the middle.

By this point, it might get difficult to put all the numbers in one system, so a clutch point is necessary. In 2010, the number of questionable fandoms (under 10 stories) has risen by 40%. The total number of fandoms – 30%. In the beginning of 2011, 2600 out of 5879 fandoms had under ten stories. Ergo, the site’s questionability rating is 44%. If 44% of fandoms are in questionable condition, 56% are not, and it might explain why the site is eager to accept more fandoms. Statistics show that a new request is bound to be more successful than not.

The trend may overturn soon, though. Questionable fandoms are taking up more server space as time passes. Since their amount is rising quicker than the total number of fandoms, the series are spreading themselves thinly. How thinly? In the end of 2009, the questionability rating, under ten vs total fandoms was 41.2%. In the end of 2010, this value is 44.2%. This means that the possibility of a fandom to grow has diminished. By a margin, but an important one. If FFN is a litmus test of fan fiction trends in the world, the questionability rating is a litmus test of series (books, TV shows, games) gaining creative support.

Further illustrating the point, let’s have a look at Chart 1 with failure and questionability rating changes. This is important: the bars represent changes since the beginning of 2010, not the ratings themselves.

Categories in the chart are ranked from left to right by total story count from biggest to smallest.

Comics experienced the highest increase in failure rating, which means the amount of new fandoms in Comics was more prone to fail than in any other top category. But don’t let the percentage increase (45%) fool you, because we’re dealing with fandom numbers 8 and 19, not hundreds. This is where a small top category with 30k of stories in total may skewer perception.

Large categories should provide a more accurate display of sentiments in fandom. Failure ratings are increasing in them more than by 10%, while the increase in questionability is above 50%, with Games dropping out. If you were to draw a line from the tip of one bar (blue or yellow) to another, you’d notice a trend of sorts (tilde or squiggle), with Games, once again, dropping out of context.

Trend or no trend, categories, which have a lot to offer in terms of variety and story count, see an increase in questionability and failure. This increase weighs a lot more than any decrease available in smaller categories, only Games acting as a dampening agent. Misc and Plays did not have a dramatic increase in failure ratings partially because they lacked numerosity of fandoms in 2010 ie, did not provide enough data for a feasible conclusion in terms of dynamics.

But there still is the general outlook. Here’s a list of questionability ratings as a percentage of fandoms in the category as of January 1, 2011:
Anime – 39%; Book – 58%; Cartoons – 27%; Comics – 35%; Games – 39%; Movie – 49%; Plays – 52%; TV – 39%. Misc have 0.

These expose a fact, which may not be up to date. Saying that, in general, 58% of fandoms in Books have not gone to grow into two-digit areas does not mean this applies for 58% of fandoms created in 2010. In some cases, this applies more than 58% because the questionability ratings have, on average, increased. To turn the “some” into exact values, though, we need to find out exactly, which series contributed to overall growth.

2010 has been a productive year for several new fandoms. Names like Inception or Socrerer’s Apprentice, having come in the second half of 2010, should not surprise anyone. Since one of these happened to come to the top of 2010 fandoms, the table below reflects the last six months of the year for context.

As you can see, TV shows are dominating the table in fandom numbers (9), movies coming second (7) with two books and two cartoons filling the remaining spots. Interestingly, a Movie, not a TV show got first place. The first two fandoms, being Inception and Sherlock, leave any competition far behind. Inception appeared on FanFiction.Net on July 14th, and Sherlock – July 29th. Making a recount based on day count, Inception has a marginal lead (0.2 of a story). If the top two create a clear distinction on the list, the next nine fandoms form another group: less than 1000, more than 100 stories, which welcomes one Book, Heroes of Olympus. The third group, less than 100 ends with two Cartoons, and no Games, which have stayed in the middle of lists so far, in sight. Anime also failed to make the margin.

For categories that did not make it in the top twenty, here is a short list with the top fandom and a list they would start being present in:
Anime – Togainu no Chi (41) – Top 30
Comics – Teenage Mutant Ninja Turtles (28) – Top 40
Games – Minecraft (14) – Top 50

Neither Plays nor Misc make it to a top ten list, pushed beyond the first hundred. But what about averages, you might ask? Surely, there might be some fandoms on the top, but, among several hundred fandoms, there might be a concentration issues, so one category takes a row of spots in the rank somewhere in the middle, while everyone else has skewed. Such an observation sounds reasonable, so some descriptive statistics are in order. For obvious reasons, you won’t see the mode. If they’re not obvious, guess, what’s the most common new fandom story number. One. The number is two for Plays, but that category has shown odd results in other parts of the analysis, so it shouldn’t surprise. The median makes sense only for Movies and TV since their top fandoms shift the average a lot, but we dodge this by removing them from the analysis (in parentheses).

Anime – 4.2
Books – 3.8
Cartoons – 13.2
Comics – 4.4
Games – 3.5
Misc - …
Movies – 16.6 (8.3)
Plays – 1.5
TV – 23.9 (14.3)

In total, new fandoms have generated under 14,000 stories in 2010.

That concludes the part dedicated to new fandoms.

TOP FANDOMS

Having analysed new arrivals on FanFiction.Net, a part of the audience may have gotten anxious about things more down to earth – the big players on FFN. Below, you have a top twenty table at two dates, the beginning of 2010 and 2011 along with changes in rank. Fandoms, that have gotten more popular in 2010, compared to 2009. If rank changes, the fandom is highlighted. When the number next to the fandom’s name is negative, it moves up (5-2 = 3, higher in the rank). Do not be alarmed that the sum of “+” does not equal the sum of “–“ as some fandoms that appear on the list were not in it before.

The top four does not change throughout 2010. With hefty gaps greater than 50,000 stories, that is easy to explain. Commotion occurs in the middle of the list with certain fandoms jumping over others in rows. The quickest jumper on the list is Pokemon, which topped five fandoms. Dragon Ball Z, on the other hand, is the biggest loser with five points extra. While Pokemon is a living franchise that may create an n number of games, movies and anime, Dragon Ball Z has a negative perspective. Its abrupt drop through the leaderboard has a further negative perspective due to no new content being released.

Another shift worth inspecting is Supernatural vs Buffy the Vampire Slayer. As of January, 2011, Supernatural, not Buffy, has the #1 in TV shows. The latter has been a long-standing leader in that category due to little activity in other fandoms. In fact, the vampire series has been on FFN since 1998, seven years longer than Supernatural. However, Buffy’s future is stable because there isn’t any TV show able to take its place in the nearest future.

Kingdom Hearts and Yu-Gi-Oh had an odd change in pacing, with the first gaining almost 10,000 stories and the latter only 4,000. While dedicated fans know more about activity in Kingdom Hearts being factored by new content, the side perspective is that Kingdom Hearts’ forums were decimated on FFN on November 25, 2010, the flagship forum losing 8 out of 10 posts out of more than 500,000 present. Apparently, forums and story content do not correlate well in that fandom.

The end of the list has two newcomers responsible for pushing CSI, another long-timer, present since 2001. Avatar: The Last Airbender and Death Note took its place. Both have a positive perspective, considering Sailor Moon and Dragon Ball Z are within range. In fact, these two, especially Avatar, are a threat to Teen Titans, which somehow managed to keep its spot as #18. On January 1, 2011, less than 400 stories separated it from Avatar. A surge of activity in Avatar is expected in the fourth quarter of 2011 when an addition to the series is scheduled.

It is improbable that fandoms, which emerged in 2010 are going to appear on the list in 2011. The main candidate, Inception, was a movie, not likely to gain enough momentum to overtake even Death Note, which has over 26,000 stories, compared to less than 2,000 of Inception.

Likewise, the possibility for an unlisted fandom to appear on the top 20 list is slim. Fandoms that used to be in the top 20, but were pushed down throughout the years, had few truly large competitors. In any case, they would have to overcome outsiders like CSI and Star Wars.

In total, top fandoms generated 230k new stories in 2010. As of January 1, 2011, top fandoms listed a sum of 1.64 million stories, almost half of all stories currently present on FanFiction.Net. The top 20 list also contains 25% of stories ever posted on the site. New fandoms brought 0.7% of this value in 2010.

INACTIVE FANDOMS

You have encountered tiny fandoms that were not likely to gain any new stories due to their size. We referred to them as “failed” (1 story) and “questionable” (less than 10 stories), but these were a projection into the future. There are fandoms on FanFiction.Net, which have not gotten a single new story in the second half of 2010 (or have gotten some stories, but administrative/other deletions brought the number back to the level of Jan 1, 2010, creating a zero sum [period comparative with new fandoms]).

1814 fandoms did not receive a single new story over the last six months. This is more than the total number of new fandoms, 1386. Those 1814 fandoms contain approximately 30,291 stories.

Results are negative for all but one top-tier category. There were more fandoms being idle in every category than becoming active in 2010. In case of Misc, its inability to receive new fandoms throughout 2010 compensates the small ratio, with Anime X-overs being the only idle fandom. The situation may worsen for all Misc fandoms based on X-overs because they were created before the site established non-section crossovers, so they could be placed in the relevant fandom instead of Misc. It causes duplication of resources, but it is not as stunning as a category someone hacked on FFN (spare image).

The list you see below could have had greater inactivity values, specifically, for Plays, but a lower story count provides a completely different situation…decaying, perhaps. When the biggest series in the category (RENT) gets barely 30 extra stories in a year despite 400 being posted, conclusions get colourful.

On a lighter note, TV offers promising activity. The difference between new fandoms and inactive ones is practically non-existent, while other top categories display nearly identical ratios close to the site’s average. It applies in Books, Cartoons, Movies (and Plays). Games, along with Anime appear in a separate group with a high idleness rate. This correlates with one reader’s opinion that the site made a mistake by trusting Anime fandoms to generate its volume in the past year. Indeed, it has the most bulk, but uses a try and try again notion that had below average results in 2010.

CONCLUSION

Let’s recap and make a graphic comparison of all top tier categories. In practice, the table below is a ranked connection of the tables used above with a summary rating in the final column. The lower the value, the better fandoms in that category fared compared to others.

Given the criteria you see, TV was the healthiest top category of all in 2010, and if you want a sustainable experience in fandom, choose TV. Books is a healthy alternative, followed by Anime, Movies and Games. In fact, Games gives you the most average experience on the entire site. It’s not exceptional by any criterion, but it steers clear from any negative ratings and risks.

Cartoons and Misc may surprise you in some ways, but don’t expect much activity or exceptional review counts if you post a story there. And if you’re a really hardcore fan, 2010 offered writership like no other in Comics and Plays. When your life makes too much sense, write in the Plays category. Who knows what awaits you in a category that’s shrinking.

DATA RECAP

FanFiction. Net has 5879 fandoms, 3,744,842 stories.
The average fandom has 647 stories.
The median fandom has 14 stories.
1368 fandoms were created in 2010.
1814 fandoms did not grow in 2010.
2600 fandoms have under 10 stories.
793 fandoms have one story.
One top category shrank.
One fandom was hacked.

Largest fandom – Harry Potter.
Largest new fandom – Inception.

Fandom lists will be available shortly.

End Notes

This is only a part of the data cache collected for FFN Research, but getting it together into readable shape does take a while. Please, show your support to this research blog, so it wouldn’t die like some of the fandoms described above.

Saturday, 25 December 2010

Research 101

We've had requests to provide the public some tips on how to conduct research of their own. Glad to oblige! Employing the easy process below, you'll be able to model various trends online in fifteen minutes or less.

First things first, though. It's not uncommon to hear "people say" or "most members think that" while these phrases are, in reality, sucked out of a finger, assumptions. Sure, we could all trust our intuition, but it would get messy, especially, when you have two equally assumptuous opinions on the table. Solution? Get to the facts, the objective stuff.

How do you do that? You find the numbers. Why numbers? They are easy to analyse and difficult to misunderstand. When you add 2+4+8 together, you can find their average, maximum value, their order in the sequence, the total, and even come to a conclusion of what number should come next. You can't do that with "dressing" - "snowman" - "goose". When you see 2, it is likely others will see it as a 2, causing less misunderstandings. On the other hand, "dressing" could be understood as getting clothes on or salad dressing. Being on the same page is what counts here.

Also, you may want to have a LOT of numbers. Why? Let's say you're standing in front of a zoo, asking people about the total number of exhibits. You ask a blind person and a small kid, two people. The first person tells you 3, the second - a gazillion. That can't be right. Solution? Ask more people. The theory is very simple: the more numbers you have on your list, the more likely it is that you'll get to the real deal. Sure, the most "reliable" way would be going to the zoo and counting yourself, but it's not that easy if the zoo closes in fifteen minutes and you don't want to spend $20 on a ticket. And if someone asked you about the number, you'll have no way to prove your count is correct. But if you have a list of people vouching for a certain number, majority rules, and it's likely the problem is solved.

Notice that you've been given one example, and some of you may not be convinced. Had there been twenty or fifty examples to justify the need to get many numbers, it would be very likely that everyone would be convinced because different examples would appeal to different people. The beauty of this is a hypothesis, and while it sounds rational, it might not be true. What if all examples are equally dumb? It is likely when only three people work on them. What if these examples contradict one another? See, there is a lot of uncertainty without proper calculations, and if you want to get to the bottom of something, you need to get your hands dirty.

Think about the topic. Do you want to study a trend? Do you want to find what influences something else? Research can tell you which day to post a story to get the most reviews, what are the prospects for your fandom and all other things you've seen in the previous posts. Amazingly, trying to solve one problem usually solves several because you can reuse and adapt your data to show a whole system at work.

TOPIC

We're working in a real environment here, so let's find things you might care about, reviews. Our assumption is that you want reviews and are interested in finding out how to get more reviews. Whether that assumption is true is none of our concern because the purpose of this post is to show you how to conduct research.

Let's make a bet that something influences reviews and they are not written at random. To make things specific, we'll choose one fandom (Sonic the Hedgehog) and one language (English) and a date. Why one fandom? Because that's where a story would be located, and every fandom has different review patterns. When you are ready to post in Sonic the Hedgehog, for instance, you might find it more useful to know the outcome of posting in Sonic the Hedgehog rather than Tetris. More reasons are described in METHOD. As such, our topic would be: "Factors influencing the review count in FanFiction.Net's English Sonic the Hedgehog section in November 2010". Why November? December is not over yet, and November is the latest full month. Stories updated in November must have gotten all the reviews you can count on, both from browsing readers and favorites/alerts.

It's necessary to have a topic written out clearly for yourself and any person, who may want to read your findings. For one, the topic won't let you sidetrack, so you reach a goal set. For two, people will know what to expect from the whole research upon reading the topic's name. Naming your topic too broadly or incorrectly will make you answer questions you didn't ask. If you think it's no big deal to have the topic written wrong, the world of research will be very cruel to you. For instance, if you're looking for review trends of 2009, you may waste time if someone didn't write the year of their research's interest right at the top. Normally, if you don't write the date or the fandom, it is assumed you're doing a general or site-wide search, which is too difficult for this tiny example. When the example is done, though, you will be easily able to make it more up to date and applicable to more fandoms.

Our topic: "Factors influencing the review count in FanFiction.Net's Sonic the Hedgehog section in November 2010".

VARIABLES

What influences a review count (in Sonic the Hedgehog)? The number of chapters, perhaps. The more chapters, the more reviews, we assume, because one person can only review once, and two chapters can mean two reviews from one reader. This might not be true, but our research will be able to answer that, too!

What else? Word count. Stories with less words have less reviews.

Experience? The more experienced the author, the more reviews his or her stories should get. Though, we don't have an experienceometer, so it has to be something else. Author's age? We could ask authors for their age, but they might lie, be unavailable and make us wait too long. Hmm, it seems deciding what variables to use is greatly limited by the ability to obtain the data. Hey, account age might be possible to take to measure experience. The longer the account has been on FFN, the more reviews it should get, we assume. It's possible to get the information from account ID. The higher (newer) the ID, the less reviews an author gets.

Let's add a fourth variable, the number of stories posted on the author's account. The more stories you have now, the more reviews you're going to get, we assume.

We could have added a fictional "yes"/"no" AKA "Boolean" variable. Boolean variables are very useful to turn obscure qualities into numbers. For instance, writer's nationality is boolean when checked by a question like "is the writer American?" In it, "yes" ("American") would be 1 and "no" ("Other country") would be 0. When the variables you have picked logically don't work, add something boolean to set them apart. They can be anything from "acts like a jerk" to "has the word 'honey' on the profile". Just don't let them go dominant in your research.

We're making a quick research here, so that'll be enough variables, four. You may not want to use too many variables in your research because it usually brings certain problems.

Rule of thumb - every variable requires 6 data points. We have 4 variables, so that's a minimum of 4 x 6 = 24 stories in the sample.

METHOD

Now that we've decided the variable we're trying to analyse (review count) and have deciding factors (chapter count, word count, author ID, number of posted stories), it's time to select a method for making the future steps.

Obviously, we're going to gather data based on observation, not a questionnaire. Surveys fail too often, and we don't have to bother anyone by taking notes of what we can see publicly.

We may want to use our research results more times than one, making them practical.

It brings both problems and opportunities. The more you want to predict, the more accurately you want to do it, the higher are the requirements and the less choices you can make.

Let's look at some of the requirements we want to fulfil. We want to apply results of our study to a general audience. By that, it is implied we're using sampling. It's very time consuming to go through all stories on FanFiction.Net (over three-million), so a sample, a part of the whole will do. This part should have the same qualities as the whole.

A visual alternative: you have to draw a specific triangle, but you don't know its angles, only the perimeter (length of the line used to draw it all). You have very little chalk, so that'll have to be a proportionally smaller triangle. One inch of your triangle could mean a hundred inches of the life-size triangle, the qualities (angles/edge length) of which you're trying to determine. This proportion has to be kept for every edge.

The problem is that you don't know how large are any of the angles nor the length of an individual edge, only the perimeter.

Don't worry. There's a magic trick called "randomness". It's difficult to explain, but if you let randomness take the pick for you, you get the most accurate results. This has to do with bias. We're unconsciously biased towards certain numbers, and we can't let that get in our way. Our opinion could only go as far as assuming the logical factors to influence the variable. That's why dice and coins are used as tools of chance.

When making the topic, we saved ourselves a lot of logistics by defining one fandom, and one month, one language. This is our "large" triangle. On FFN it spans from page 23 till 44 of this fandom. For safety, let's reduce our page count to 24 - 43 because the pages may shift while this guide is being written. Every choice has to be justified.

Now, we see a beautifully placed list of stories. Every page has 25 stories. There are 20 pages, so 20 x 25 = 500 stories. Our large triangle's perimeter is 500. Now, we decide the proportion.

This is a crucial moment, so you could pick enough stories to be accurate while not straining yourself with repetition. Another rule of thumb is to have at least 100 data points. If you examine 100 stories, you don't have to prove certain things and can give the data the default treatment. However, you may want to do things more precisely. Commercially popular sample sizes are 250, 500, 1067-1100. The sample size determines the error band. When you have 1100 people/stories examined, the error margin (confidence interval) tends to be 3%. This margin determines the statistical difference. In statistics numbers 43% and 44% are not necessarily different (don't have statistically significant differences) because they might be affected by your error margin. It's possible to determine the interval manually, but I like using this website to do it for me.

Even if you have a small confidence interval (3% is small), there is a chance some freak statistical accident happened and your research doesn't mean squat. It's called a confidence level. In commercial data collection, it ranges from 90% to 99% because you cannot be 100% sure of anything when sampling. The higher the confidence level, the more data points you need. The higher the error margin/interval, the less you need. These two are independent. You may have a 99% certainty level the average review count is 10 +-8% or a 50% certainty level the average review count is 5 +-1%. Confidence levels are, generally, more important. Don't aim for lower than 93% because when you conduct 100 samplings, 7 of them would make no sense at all, and you don't want to be a part of those 7. The max error margin we can tolerate is 10%. There are other factors that matter, but we're aiming at a quickie.

We're going to pick 100 stories, dodging some evidence gathering, at a 95% confidence level, which would give us an 8.8% error margin.

All right, we have the proportion now: 100/500 = 1/5 = 20%. There are two ways we can go now. The easy way in our case is to do systematic sampling, which is, basically, taking every fifth story you see and taking notice of its data. The randomness here comes in having chance decide which story to start from. Since I don't have a 5-point die, I'm going to take five bits of paper, number them from 1 to 5 and let someone pick one of these. The number that is pulled out is going to be which story from the beginning of my sample on page 24 is going to be first, so I go to every fifth after that. Let's assume I did that, and the pulled out number was 1.

The second approach is more difficult in this case, but applicable to more things. Sometimes, it's impossible/unnecessary to know the proportion. FFN "should" have over six-million stories, but it has only three-million. If we hadn't done research before, we wouldn't have known this and, you would think, this would lead to bad samples. Not at all. Sometimes it's impossible to make a list or know the beginning, the end of something. This is where randomness replaces the list without changing any confidence-related issues. Had there been a different number of fanfics on every page (instead of 25 on every one), we would have used Excel's random number generation, and asked it to generate 100 story IDs from 1 to, say, 6 million.

Looks like we're all set for practice.

ANALYSIS IN EXCEL

We're going to work with Excel's 2003 version. First, let's make sure we have what we need. In the top menu, click "Tools" and see if you have "Data Analysis" in the drop-down menu.

If not, go to View-Toolbars-Customise. Click "Commands" Find "Data" and see if you have "Data Analysis" to choose. If you do, just make it visible.

If you don't, we'll need to go to Tools-Add-Ins. Check "Analysis ToolPak - VBA" and/or any other version of the phrase you may have. Press "OK", restart Excel and see if you have "Data Analysis" in "Tools" now. If you don't, your version is either incomplete (use the installation disk, and install Add-Ins for Excel) or you are on a restricted computer.

Moving on. Click Tools-Data Analysis. "Random Number Generation" should be highlighted by default. Excel is clever and knows people first need the random number first. Click it, and you'll see a new window. Pick the drop-down Distribution (in the middle) and pick "Patterned". Now, you're in total control. The number I see on top of page 24 is 576. The number on the bottom of page 43 is 1075. The important part is our proportion 20%, which means every fifth story goes, and we needed 5 bits of paper to decide, whether we start at 576, 577, 578 et cetera. Our paper said 576, so that'll be the first number we write in ("From:").

Fill the window with the following.
Number of variables: 1
Number of random numbers: 100
From: 576 to 1075 in steps of 5
Repeating each number: 1 times
Repeating each sequence: 1 times

Delete the last number if you get 101 results. If you're picky, try "Uniform" in the drop-down of a new "Random Number Generation" window. You just input the range with the rest being identical 1 and 100. Uniform is, generally, a better solution because it requires less input from you at first, and you may have to merely round up the numbers and add 1 if you get two identical numbers after rounding up.

Now, we have the story numbers from the pages we need. We should note the data each story has. If you don't have the time to list the variables, just save links to the 100 stories to check them later. (Instead of making five clicks per story, you'd make only one.) Even if you have the time, save the link or story ID you get upon clicking the stories because page numbers change (that's why we moved from 23 to 24, and the pages did shift since the beginning of this paper). Of course, this is a matter of choice and seriousness.

You have 100 numbers now. I suggest selecting the 100 numbers (not the whole column), pressing CTRL+X, and putting the cursor on cell A2. Press CTRL+V. You should see them now placed one cell lower.

We should do some labelling (that's why we lowered the rank numbers). Write 'y' in cell B1, 'x1 - chapters' in C1, 'x2 - words' in D1, 'x3 - author ID' in E1 and 'x4 - story count' in F1. Add extra labels to make the columns more informative if needed. Consider freezing the first two rows in your sheet by selecting them and going to Window-Freeze panes, so the labels wouldn't get lost. You may notice that 100 stories on the list is more than 24 required by our "times six" rule of thumb. That's good.

What we got to do now is start taking notes. They should look something like this when you finish. Scroll down.

y - x1 - x2 - x3 - x4
9 - 1 - 870 - 1720168 - 3
0 - 1 - 2091 - 2332564 - 35
11 - 4 - 8977 - 1146820 - 4
13 - 2 - 1696 - 2497515 - 8
32 - 12 - 54094 - 370579 - 15
0 - 12 - 9070 - 2600296 - 3
2 - 3 - 2346 - 2625209 - 4
10 - 1 - 942 - 2322399 - 10
0 - 1 - 812 - 1445016 - 27
48 - 14 - 10430 - 2234950 - 2
1 - 1 - 756 - 1247257 - 4
4 - 2 - 6918 - 2254848 - 9
1 - 1 - 828 - 2500706 - 8
11 - 4 - 2851 - 2466270 - 4
51 - 5 - 27418 - 2464934 - 5
23 - 7 - 42338 - 2246255 - 17
32 - 7 - 9075 - 2349427 - 51
0 - 5 - 5611 - 2592567 - 2
23 - 11 - 3932 - 1960339 - 7
6 - 1 - 1326 - 2432493 - 22
2 - 3 - 7584 - 2254848 - 9
62 - 7 - 30210 - 2407962 - 8
36 - 19 - 29653 - 2469814 - 17
19 - 23 - 53349 - 1733388 - 23
4 - 1 - 3013 - 1802183 - 14
1 - 1 - 1480 - 2405648 - 24
3 - 2 - 1646 - 2001585 - 16
11 - 3 - 2300 - 2443927 - 4
2 - 1 - 1772 - 2619494 - 5
1 - 5 - 4977 - 2592556 - 1
16 - 10 - 16627 - 2416048 - 4
2 - 1 - 969 - 998811 - 5
7 - 2 - 3288 - 2621859 - 1
12 - 4 - 2387 - 2572568 - 1
1 - 1 - 1188 - 2576690 - 1
287 - 22 - 106903 - 1263516 - 24
21 - 7 - 9713 - 2229401 - 7
4 - 4 - 3526 - 1842866 - 8
25 - 9 - 6025 - 2418265 - 22
1 - 16 - 12514 - 2581451 - 4
2 - 8 - 5571 - 2324060 - 18
3 - 4 - 8165 - 1055075 - 7
0 - 1 - 641 - 2533529 - 1
0 - 1 - 637 - 2564950 - 12
0 - 1 - 411 - 2434313 - 5
145 - 12 - 51373 - 557082 - 42
1 - 2 - 816 - 2615675 - 1
1 - 1 - 1333 - 2397687 - 3
6 - 1 - 286 - 2363663 - 3
0 - 1 - 653 - 1890945 - 5
0 - 1 - 773 - 2434313 - 5
3 - 3 - 1378 - 2208560 - 9
64 - 23 - 83711 - 909079 - 12
2 - 2 - 2241 - 2603413 - 1
3 - 1 - 4643 - 1314061 - 14
1 - 6 - 807 - 2514303 - 5
0 - 1 - 450 - 2082789 - 3
9 - 1 - 691 - 2497515 - 8
1 - 1 - 201 - 2562978 - 1
6 - 2 - 3249 - 2315797 - 4
165 - 46 - 197328 - 1102393 - 1
3 - 13 - 40438 - 1598320 - 2
25 - 15 - 15107 - 2143219 - 2
33 - 10 - 13654 - 2141369 - 9
20 - 6 - 32769 - 120594 - 3
6 - 9 - 28965 - 1495936 - 22
4 - 1 - 1520 - 2349427 - 51
28 - 7 - 24952 - 2164733 - 8
10 - 8 - 10808 - 2048230 - 2
279 - 30 - 65882 - 1894188 - 10
5 - 1 - 1249 - 2603600 - 5
22 - 14 - 25407 - 2088418 - 1
0 - 2 - 1430 - 2421071 - 1
3 - 1 - 1530 - 1938657 - 7
0 - 1 - 4661 - 2605547 - 2
122 - 21 - 52851 - 1098628 - 5
15 - 2 - 3880 - 2100751 - 19
5 - 3 - 3723 - 2467839 - 2
0 - 2 - 1010 - 2127913 - 4
24 - 15 - 34346 - 1070963 - 5
3 - 1 - 1226 - 2474307 - 10
5 - 1 - 1698 - 2371159 - 11
0 - 2 - 1931 - 2602634 - 2
3 - 1 - 786 - 1890867 - 43
12 - 5 - 6181 - 1685030 - 5
26 - 7 - 6204 - 2133339 - 9
2 - 1 - 963 - 2316070 - 3
54 - 13 - 40760 - 2164733 - 8
8 - 6 - 1502 - 2338442 - 22
0 - 1 - 603 - 2127913 - 4
86 - 15 - 47367 - 1947992 - 15
30 - 11 - 16908 - 2140302 - 4
3 - 3 - 2289 - 2230728 - 6
3 - 11 - 6345 - 2400672 - 1
2 - 1 - 610 - 2547041 - 9
1 - 1 - 1266 - 2404673 - 5
1 - 9 - 6051 - 2596418 - 2
14 - 4 - 4525 - 1543587 - 7
3 - 3 - 3958 - 1514770 - 62
0 - 1 - 494 - 2594903 - 2

Every sample should be publicly available, so others could check your results for validity. It's easy to say "I've done research, surveyed 100,000 people and found that 2 out of 9 are pet owners," but others won't always take your word for it. Samples are usually available on demand as links you can download, not as lists in the middle of your research. The reason you see them here is to save you data collection efforts. By the way, gathering the above took me 35 minutes. This is a slow outcome because I had to click not only to the next page, but also on pen names to find their ID numbers and story counts in another window. Mind you, if you share the burden with two people or don't have to open new windows, you may do it sooner than your media player switches tunes.

Intercorrelation

The "scary part" comes next. It's scary because it has a lot of symbols you probably won't understand and won't need. But first, we lighten up our model. The list of numbers above is the basis of our model, a simplified version of reality. By simplifying it, we may lose some accuracy, but that's okay, because we can always add variables, make the list longer and reach an impractical level of accuracy. You may feel the difference between no reviews and ten reviews, but not 1.04 and 1.06 reviews. Be practical.

By "lighten up" in the previous paragraph, I meant dodging derivatives. You see, we have a dependent variable, our y, the review count. This variable is influenced by what we call "independent variables" x1, x2, x3, x4. While we logically tried to decipher what would be a factor for the review count, we might have, accidentally or otherwise, added variables that depend on one another, are derivatives. Notice how we tried to pick a variable for experience, looking into alternatives. We didn't know for sure whether they were alternatives, but we deemed them so. Statistical analysis allows us to see whether we've included two or more similar alternatives in our model. A model should be efficient and practical, so it's unnecessary to have a variable, which doesn't add value.

We're going to take the BACKWARD procedure. To use it, we need to have all our variables in the table, which we do, and start picking out the variables that don't add value. First, let's do a correlation matrix. In Tools-Data Analysis, pick "Correlation" and select all the data whilst not forgetting to tick "Labels in First Row". Click OK, and you should get a triangle of numbers.

Name: - reviews - chapter c - word c - author ID - story c
reviews - 1 - - - -
chapter c - 0,715994603 - 1 - - -
word c - 0,75572468 - 0,876722508 - 1 - -
author ID - -0,362859097 - -0,37127767 - -0,497448704 - 1 -
story c - 0,14074523 - 0,003728427 - 0,058269805 - -0,208831173 - 1

This matrix/triangle tells us how aligned is one variable with another. The first column explains how attached the dependent variable, our review count, is to our other variables. The higher the coefficient, the better. Anything above 0.8 is so awesome you can draw a straight line and call it a day. In the first column. If anything in other columns (save for the diagonal of 1) is 0.8 or higher (or -0.8), things are bad. It means one independent variable depends on another, they're alternatives. As such, one of them will have to go. And yes, we have that problem. Right in the centre, where "chapter c" meets "word c" we have 0.87. It means one follows the other 87% of the time, and such repetition is redundant. One of them has to go.

How do we decide which? We go to the first column to find which of the two variables "chapter c" or "word c" is a smaller influence to our review count. 0.72 for chapter c vs 0.76 for word c. Therefore, the word count is more important to us than chapter count, and chapter count has got to go. What do we do now? We copy the chapter count column somewhere far, so it wouldn't get lost, and delete it from our main table. My suggestion is to have two sheets with tables, one being main and the other - your work horse, which you edit and mutilate according to what Data Analysis tells you. Arrange your columns comfortably if placement has shifted.

Okay, we got rid of one faulty variable, and there weren't any more interdependent variables. Had there been more than one point above 0.8 or below -0.8 in columns after the first one, we would have needed to remove another variable, the less important of that pair.

Regression analysis

We have just one magic trick left to discover, regression. Explaining what it is in non-math language can be difficult, but it is like a healthy, working generalisation. For instance, you see car tyres as round, you draw them as round, and they are used as an example of roundness. However, if we take a microscope, we'll find the tyre is very uneven, full of dents and little furrows we don't really care about. Regression lets you get to what matters, the essence of a happening, so you are not distracted by something insignificant or scarcely irregular.

Tools-Data Analysis-Regression. Click. We get a very frightening table with lots of tick boxes and input ranges.

Input Y Range: click on the white space after the colon and select the y (review count) column, finishing your selection by the last filled cell. Don't add empty cells, and don't add more than one column to your selection.

Input X Range: select the remaining three columns from the top to the last filled cells. It should be a rectangle with 3 columns and 101 (100 numbers + labels) rows.

Below you see three checkboxes. Tick "Labels" because we have included them this time. You don't have to include labels; Excel will give your variables generic names, but we want clarity here.

Tick "Confidence Level", and set it for 95%. It should be also the default number.

Never ever tick "Constant is Zero" or our car tyre model may turn into a square. If you're curious, ticking that would kill one number responsible for evening things out.

Don't touch anything else in the window, and just click "OK". You have a new sheet. Rename it to "regression" if you want. On top of the spreadsheet, you have SUMMARY OUTPUT and three weird tables, each with more columns than the previous. We'll be working only with the third one, but the other two are useful, too.

The top one tells you, basically, one thing. You may have seen "R2" or "R squared" mentioned in our previous releases. It is a coefficient, which explains how well your variables determine changes. In our case, how well the word count, author ID and story count determine the review count. This number ranges from 1 to 0, and anything above 0.8 is awesome. Anything below 0.3 is horrible. In our case, Multiple R is 0.76, which we disragard, and look at the second row R Square. It's 0.58. This means that if you get 100 extra reviews, 58 of them can be explained by how many words you used, how many stories you wrote and when you joined the site. 42 come from factors we have missed.

Now, when you have a lower R Square, below 0.5, it can mean two things: you've missed some important factor while brainstorming or there is a problem with the numbers you've attained. There are methods on refining your data, but our example looks good, so we won't need them.

Have a look at the second table creepily labelled ANOVA. On its right edge, you see Significance F. Let's call it "the fail factor". It's 4.06 divided by a number with 18 zeros or "4,06E-18" (0.000...0406). It's a very small number, which means our fail factor won't bother the results. When you see this number grow big, reaching 0.1 and the like, it means your research is destined to fail and you might as well give up because making it work would be as difficult as heart surgery. The fail factor applies not to one variable, but to everything at once, and any connections you make are a coincidence, a fake. But let's put a smile back on your face because our model is safe.

A bit robust, though. We're going to have to butcher it a bit. Third table. There are three methods we can use. All of them should (almost always) give you the same results. Before we do anything, though, look at the row that says "Intercept", the first row of numbers in the third table. Highlight it in yellow, make the text white and do whatever you need to ignore what's written there. Once that is done, here's what we're going to do: see if there are irrelevant variables in the model. Sometimes, a variable is not important enough, does not cause enough changes to your review count, so we may safely kick it out. We determine if any variables are useless, and carefully puncture them out.

Three methods for removing weak variables:

t Stat column. Rule of thumb: any value between -2 and 2 (higher than -2 but lower than 2 [0.7, for instance]) means you should highlight the number's row red, ready to kick it out.

P-value column. It shows the possibility for a particular variable being useless. See your confidence level (we have 95% or 0.95 without the percentage). If the P-value is higher than 1 - confidence level (we have 1-0.95 = 0.05), highlight the row red, ready to kick it out.

Lower 95%-Upper 95%. See if they have different signs (one is positive and one is negative). If they do, highlight red.

The most reliable is the third one because it's easy to see the difference between number signs, but any one of these is enough. If you check the table (word c, author ID and story c rows), all three tests would have given you the same results. word c has a high t Stat value, low P-value (E-17 means divided by a huge number), and Lower 95%, Upper 95% have the same sign.

Author ID has a low t Stat, only 0.545, lower than 2, a P-value higher than 0.05, and signs are different on the Lower-Upper columns.

Story c has 1.55 t Stat, lower than 2, but higher than what Author ID has. P-value is 0.12, higher than 0.05, but lower than what Author ID has. Lower-Upper columns have different signs.

Looks like Author ID and Story c would be highlighted red for removal, but we don't remove them both. Like when we ditched chapter count, we have to cull them one at a time, the least important first. Chances are both will end up as totally unimportant, but when we remove just one variable, the whole model might change.

As you could see, The Upper-Lower test with different signs works as far as telling you "there is/isn't a problem" (boolean), and you can use either t Stat or P-value for deciding which variable is removed. In our case, let's use t Stat. Author ID has a lower t Stat, so we go to our working table, and remove that column.

We are now down to two independent variables, word c and story c, along with our review count.

Tools-Data Analysis-Regression.

Repeat the process. Review column from top to the last number in Y Range, and word c, story c columns in X Range. Tick Labels, Tick Confidence Level 95, click OK.

Once again, we see three tables. Let's look at R Square. It's still 0.58, which means ditching author ID did not lose us even one percent of usefulness. It won't be missed. We skip the second table and go right to the third. Feeling fast, let's go for the P-value test. Only one P-value is higher than 1-0.95=0.05, story c. 0.14>0.05. The t Stat is also lower than 2, so we highlight the row red, and go to the working table.

Delete the "story c" column (should be the one on the right). Now, we're down to just reviews and "word c", two columns. Tools-Data Analysis-Regression. Repeat the steps, only the X Range will be one column instead of two. Click "OK".

And we have another sheet. Looking at the first table R Square is 0.57 (was 0.58). Ah, so we did lose something with the story count. It may mean that the number of stories you write has an influence to your review count, but it is so insignificant, including it will only make our calculations complicated for very small perks. In any case, the drop was just 0.01 because the t Stat and other tests called that variable insignificant. Had you accidentally kicked an important variable, R Square would have dwindled...by a third or something.

So, what do we have now? Obviously, t Stat and other tests are okay. We're out of insignificant and useless variables. Oh, and look at the corner of the second table! Our fail factor has become lower. It's 1.02E-19. Used to be 4.06E-18. 40 times smaller. Nearly 98% of our fail factor was contributed by the variables we kicked.

Now, we can draw a rule for the review count in FanFiction.Net's Sonic the Hedgehog section in November, 2010.

y=0.001326x1 + 2.46 + e

y - review count
x1- word count
e - compulsory random error, for all the forces we did not account for

As you can see, the function is linear. By default, you should get 2 reviews in Sonic the Hedgehog. Every word you write, according to this function, adds a thousandth of a review. This means, if you write a thousand words, you, statistically, get 3 reviews. "Statistically" means "on average". This equation is a pretty good tool to measure how well your story is faring against works of others.

Right now, you can make an estimate on your stories written in that fandom. You know what influences the review count, and how many reviews you can expect when you start writing there. If you're a review hog, have a group of friends analyse several fandoms, and join the one, which gives you more reviews per written word.

CONCLUSION

Conclusions are necessary in research. They must be brief and informative because some people like spoilers, and skip to the results.

In Sonic the Hedgehog of FanFiction.Net during November, 2010, the total word count influenced the total review count. There was a positive linear relationship, where every extra word added a thousandth of a review.

The number of submitted stories and author ID were irrelevant to the total review count. Neither was the chapter count, an alternative of the word count.

EXTRAS

Here is a bonus for the curious. You did see that our linear function was described as "pretty good". What if there is a better way? Surely, if someone writes 50,000 in one chapter, they can't possibly get as many reviews as someone with a more reasonable 5,000? Nobody reads 50k in one chapter, you may even think. And your thoughts may be right. Regression analysis gives us linear results, and the line can go either up or down indefinitely from start to finish. We could build a curve.

However, the problem with curves is that the more complicated they are, the more time it takes to put one to use. That in mind, we go to our working table, with just two variables review count and word count. We're going to draw a chart. First, move the reviews column to be on the right of the word count column. We need it to dodge some messy misconceptions on Excel's part.

Insert-Chart. Pick XY (Scatter). It's very important that you use the scattered dot matrix. Upon clicking it, select the default subtype without any connections. Click Next. In the window that appears, you may have what you need already, but, to be sure, look at Data range (below the chart), erase it, and select two columns, the review count and the word count. Make sure the series are in Columns (radio selector). Click either "Next" or "Finish" because we should have everything now.

You should see a weird mess of dots, lots near the zero point, and just a few far from the beginning. Left-click on one of the dots. Several of them should light up yellow. Right-click on the dot, and select "Add Trendline". A new window should appear. You should see different curve types you can select. The linear is the default one, and it would have been identical to the equation above. We select the top right one, Polynomial. Most of the time, it's the most useful curve type. Now, go to Options on top of the window. You should see three tick boxes. Tick the third one, Display R-squared value. Go back to Type, on top of the Add Trendline window. Look at "Order" next to the Polynomial curve.

2 order gives you a parabola. 3 gets a cubical parabola and so on. The higher the order, the more steeply it will rise. Right now, we have to decide, which order is the optimal one. The optimum is somewhat arbitrary. If a higher order does not give you a "sufficient" increase in the R-square value, stick to the current one. If you recall, our linear trend gave us a 0.57 value, so 43% of all changes are a mystery. Let's pick order 2 and click "OK". A curve appears. It reaches to the bottom at a certain point, and R Square is 0.62. That's a 5% increase. We've found a better estimate for our function, but is there an even better one?

Repeat the steps: left-click dot, right-click-select Add Trendline, pick Polynomial - Options, tick Display R-squared value on chart - Type, pick order 3, OK. Now, it says 0.701. Eight percent. We've gone up from 0.57 to 0.701 in total only by changing the curve's form. Truth is definitely out there. Usually, it's a sign that going higher is useless, but you can try orders 4 and 5. Make the graph larger, so all the numbers fit on-screen. Order 4 gave 0.707, less than one percent. It's reasonable to assume things only get worse from there. Order 5 is too complicated, and too useless.

Order 4 is going to be a pretty long equation, and minuscule extra accuracy isn't worth the high-power equations. Order 3 is good, but it will lead to an irrational end (study a 3rd degree parabola). Order 2 isn't bad, but the gains aren't huge either. Let's leave it at order 3. The nearly 10% increase in accuracy is very nice. Right-click the second lowest (order 3) trend line, Click Format Trendline, go to Options and tick Display equation on chart. OK.

It would be: y=-2E-13x^3+4E-8x^2-0.0002x+5,436

This one, while better suited to describe the review count in general, has two problems.

1. It cannot be used for stories longer than 170k.
2. It overappreciates the minimal number of reviews a story can get.

As such, it is good in theory, but, in practice, stories are shorter, and their brevity calls for a different system of reviews. For this reason, let's also include the 2nd degree polynomial function:

y=-5E-9x^2+0,002x-2,7011

Interestingly, the number of reviews would drop after a story gets more than 200k words. While reasonable, this function has an accuracy problem, compared to the 3rd order. The solutions can be mind-boggling, like taking one function for word counts 0 to 10,000 and another for 10,001+. Less exotically, once we decide to get to the bottom of the issue and stop tolerating discrepancies, we need to not only drop variables, but also drop data points. Without going into two complicated tests, pick Tools-Data Analysis-Descriptive Statistics. Select the two variables, labels in first row. Tick summary statistics and Kth Largest, Kth Smallest, both set to 1. OK. There should be a table with four columns, two per variable. We're going anomaly-hunting.

We need two things, the top value, Mean, and Standard Deviation, row 7. Add three times Standard Deviation to the Mean. For word count, that would be 13,727+3*26,726=93,905. Why are we doing this? Anything above this value is an anomaly, and only 0.3% of all values can be higher than this without messing up our calculations. Since we have 100 data points, any one word count above 93.9k is an anomaly. What do we do to anomalies? We delete them. What do we get afterwards? A headache, looping back to the charts. That's the beauty of statistics: while 80% of all accuracy requires 20% of effort, getting 20% more, you guessed it, makes you sweat a whole 80%.

Hopefully, this has been an interesting enough adventure in the realm of online research. Calculating the basics really takes but a few minutes, but when the world gets you stumped in conclusions that seem impossible, you may spend hours. And when you think this is ludicrous, ask Facebook or Google if there's a better way to get into your head.

Merry Christmas, folks!

Friday, 1 October 2010

Erased Accounts

We're not idle here, don't worry. Since FanFiction.Net has been glitchy as of late, it was next to impossible to publish a list of purged user accounts. Likewise, it was difficult to create a list of good stories with the most objective criteria available.

As of today, the pending analysis of good fan fictions has a 90% confidence level, which is insufficient for further group analysis. No conclusions are presented from this research to prevent erroneous assumptions.

However, we have a static list of accounts deleted in the years 1998 (since October), 1999, 2000 and the first half of the year 2001. Here is the list. 4700 user accounts were purged in this term. It is an accurate list derived from observation for those dates with 98.7% of all accounts from ID 1 to ID 80,000 checked. It is not suggested that you make site-wide conclusions for the current situation, as the domain's growth and guideline changes ensued in 2002 and 2004, which created conditions that would render the numbers attained for the first 80k inapplicable for later years.

The 0.85%, as attained via our earlier sample remains as the accurate number for deleted accounts. Yes, approximately 1 out of 100 accounts is deleted on FFN. If you have 1000 favourite authors, 85 of them will cease to exist due to infringement. Should you contest this number, the accounts in our random sample are provided in a separate file (like in the previous post).

Sunday, 18 July 2010

FanFiction.Net Member Statistics

The research team is proud to present you first numerics from our user-related queries. This post answers many questions, including the following:

-How many writers are there on FFN?
-How long will you stay on FFN?
-How many stories do they write?
-How many users are deleted from FFN for infringement of ToS?
-How quickly does FFN grow?
-How many readers you should expect for a story?

First, we must present the methodology, though. The study consisted of generating 1100 random user account IDs spanning from 1 to 2,400,000 (source data at the bottom). It allows us to generate representative unbiased results at a 95.34% confidence level and a 3% error margin. The list has been generated on the 29th of June 2010. Therefore, we have included all accounts that have been registered, enabled and fully functional, without restrictions of story creation or profile/review posting.

Now, the definitions. You will see the following criteria used in this post:

Empty account: any account that does not host stories uploaded by the owner. In layman terms, there are no stories posted in this account. There may be favourites. Here and here are examples of accounts dubbed 'empty'. Conversely, this is not an empty account.

Active/alive account: any account that has shown signs of life in the past six months, from January 1, 2010. This may be the following: updating or posting a story OR updating the profile OR adding a favourite story OR reviewing a favourite story in the past six months. For example, these two accounts are called 'active' or 'alive' in this post. In the case of the second example, please check the favourites. As long as at least one criterion is met, it is active. Those, who have joined fan fiction in the year 2010 are active by default due to a professional grace period to create a story.

Inactive/dead account: any account that does not meet the active/alive criterion above. Here are two examples.

Deleted account: any user ID that shows the following or similar message "User does not exist or is no longer an active member."

Main Part

You probably recall that FFN has ~3,300,000 stories from our last research (number rounded up to accommodate growth since the previous post), which is 53% of all posted material, with the other 47% deleted. Keep this in mind for a moment.

In the sample of 1100, we have discovered 742 empty accounts, which means, via representativity, only 32.5% of all FanFiction.Net users have stories posted. How does that transfer into general numbers? In a population of 2,400,000 members 781,000 have stories (4.2 stories per account with a story on or 1.375 stories per every member), while the remaining 1,619,000 do not participate in adding content. Two thirds of all members are pure readers, or so it may seem. If it were correct, we could say that 1 writer has 3 dedicated readers on average, if we assume writers themselves read. However, it's not that simple.

Some accounts are plain dead. How many? In a sample of 1100, 855 accounts were inactive, and showed no signs of life in the year 2010. What does that mean for FFN? 78% of all accounts on FanFiction.Net are dead. Less than a fourth, or 22% is currently at your disposal, or 528,000, which is less than the number of accounts with stories on them.

The fun part begins now. How many writers are active? Who could you expect updates from? We connect the overlapping clauses of 'active' and 'not empty'. In a sample of 1100, 130 accounts showed signs of life and had stories on. It translates into: 12% of all accounts on FFN have at least one published story and are actively engaged in fandom activity. 88% of members on FFN are currently not shaping any fandom. As for those, who do, there are 283,000 of them. We have found out that there are 5259 fandoms on FFN, which would mean 54 people keep a fandom alive in the course of 6 months.

On average, no more than 54 people appear in a fandom over six months. How many new people is that per day? 0.3 of a person drops into an average fandom. An average fandom has 681 stories. A median fandom, the one in the middle, which ditches the enormous influence of HP with 0.5 million stories, has 16. That was a bit of extra information, and we now return to users.

One aspect of FFN particularly interested the research team, the number of account deletions by the administrator. 0.73% was the number we acquired. That's less than 1 in 100. However, let us convert that into raw numbers. 17,500. We add an arbitrary 3000 to that number because accounts from 1 to 3000 are unavailable, and the account number generator did not account for it. What do we get? Since September 1998 fanfiction deleted over 20,500 users for infringement. It stands for 0.85% of all users. 4.75 accounts are deleted per day on average, a very modest number because we disregard deletions impossible to document and test easily, like those attributed to policy changes (for instance, when MSTs were deemed unwelcome).

Who would that be? Blacklisted people: spammers, trolls, plagiarists, other infringers. They missed a few trying to use FFN as an advertising venue here and here.

By now, you already know how many account totals are there. It's time to break them into a time series and give you an understanding of how quickly FFN grows.

A table below tackles this issue. We need to explain the columns for complete clarity:

Total: the last account ID created in the year (AKA summary number of accounts created until December 31, all years including the one in the row [accounts made this year + all accounts made in the previous years])
Change: number of accounts that were created in the year in question
Growth%: how much accounts FFN gained in comparison to the previous year, excluding accounts created in the previous years.
CChange%: chained value of change. The ratio of Change (this year to last) divided by the ratio of Total. Answers how quicker (above 1)/slower(below 1) grew this year in comparison with the previous, acceleration.
Middle: the date when half of the annual growth is reached, 50% of accounts created in that year are already present by this date.

Year - Total - Change - Growth% - CChange - Middle
1999* - 6749 - ... - ... - ... - ...
2000 - 33,090 - 26,620 - 411.4 - ...
2001 - 147,200 - 114,110 - 344.8 - 0.19
2002 - 318,900 - 171,700 - 116.6 - 0.16
2003 - 512,000 - 193100 - 60.6 - 0.32 - June 22
2004 - 733000 - 221000 - 43.2 - 0.5 - June 13
2005 - 959000 - 226000 - 30.8 - 0.55 - June 29
2006 - 1188200 - 229200 - 23.9 - 0.63 - June 21
2007 - 1458900 - 270700 - 22.8 - 0.78 - June 17
2008 - 1788000 - 329100 - 22.6 - 0.81 - June 3
2009 - 2238000 - 450000 - 25.2 - 0.89 - May 31
2010** - 2680000 - 442000 - 19.8 - 0.66 - July 21

*Accounts created in 1998 added. It is impossible to tell when exactly a person joined before 2000-01-07.
**estimated, based on the first 6 months.

Before we begin analysing the data, there is an explanation for our 2010 estimate. We calculated it according to seasons, not a plain average. Based on our calculations, by June 21 the site receives 50% of its annual account growth spurt. This means that slightly more accounts are created in the first half of the year, than in the next six months. Site-wide, there is no reason to assume 'big' events like the release of a movie or a new popular book create significant fluctuations. Years before 2002 were not included due to volatility while the site was still young.

Now, let's carry on with the examination. As you can see in the Total column, the site is growing every year. Rational. The Change column shows that an increasing number of people joins the site up to 2010, with the period from 2004 till 2006 being stable in terms of Change. Things become trickier with Growth% and CChange. Some of you may be confused why a site which is growing more and more in raw numbers seems to score poorly in the last two columns. The explanation is as follows: as the site grows, it needs a larger number of new accounts to sustain itself. Simple example: site with 1000 accounts made in the previous year gets 1000 more this year. Next year, it will be 2000 accounts. If the site grows another 1000 next year, this 1000 will be relatively smaller (50% vs 100%) than the first. The same is happening to FFN, as it gains a similar number of accounts that weigh less and less.

The rate of acceleration or slowing down is most visible in CChange. Not a single value is higher than 1, which means the site never grew faster than the year before. On the contrary, the rate of slowing down, the closer to zero the less momentum the site gains compared to last year. From 2000 till 2009, deceleration (slowing down) was becoming closer to 1, a sustainable equilibrium point, but the year 2010 returns us to levels of 2006.

In layman terms, imagine two speeding cars. One of them is the site, and the other is 1, how the site did last year. The other car is a ghost/time challenge type that repeats the race as it was before. The ghost reaches the finish line first every time because your car never reaches the value of 1. You lose one race. Next time, the ghost repeats how you raced the time you lost. And again. Meaning, every race the ghost is slower, repeating your losses. You keep losing, though. While you do, you notice that if at first you lost by a long shot, after several runs, you still lose, but 1 is a lot closer.

If it weren't for 2010, a great gap in a seemingly fluent continuity, we could have made an obvious conclusion that FFN will, eventually, grow faster, and its growth will be bigger both in volume and ratio that volume takes in the whole (your car will start a winning streak).

Regression analysis showed that there is a polynomial relationship between time and growth. Linearly, there is a positive relationship and a linear trendline would claim that the site will reach CChange=1 in 2012. With an R^2=0.825.

A polynomial trend fits better, with R^2=0.9 for the parabola. It means that the function you will see below 'catches' 90% of all vibrations that our growth spurt (CChange) makes, and best describes fluctuations in growth on FFN. What does that R^2 mean? 90% of all growth fluctuations are explained by time in the function below.

y = -0,0094x^2 + 0,218x - 0,4813

y - CChange value

x - number of years since 1998 (0, 1, 2, et cetera)

Basically, this function allows us to calculate the future of FFN. What is it? Well, according to this, the CChange value will be 0 when the site reaches 21 years of age or by year 2019. This is the scenario we follow if the site does not gain momentum by 2012. If we employed descriptive statistics, any CChange above 0.779 and under 0.3 would have been considered anomalous (the rule of three standard errors). Removing those values gives us a more pessimistic, yet less accurate, picture of these events. Reaching 1 would take three years longer linearly, and negative CChange would also be acquired sooner in more reliable polynomial models. Our choice on extrapolation is based on the principle of numeric accuracy, provided other factors remain static. Surely, clever website management and an increased interest in fan fiction as a concept is bound to change the end result. It does, however, suggest that site administration would avoid the trend described in this exercise.

As a final part of this piece of research, we would like to address a number we have shown you before 12%, the number of accounts that have stories on and currently participate in fandom. Another 10% are active readers and do not have any stories posted. This is a general number, though, and we are sure You are more curious to know where do you stand with your peers rather than the whole site.

Below is a table with the following columns:
Year: year of joining.
Full: possibility% that your account is still active and has stories if you joined in the designated year
Empty: possibility% that your account is still active, but has no stories, if you joined in the designated year
Full stays: the probability% that if you have stayed until July 2010, you have stories on

We start from the year 2002, when initial FFN volatility abated. Empty in 2010 is skipped.

Year - Full - Empty - Full stays
2002 - 6.4 - 2.5 - 71.4
2003 - 8.5 - 1.1 - 88.9
2004 - 3.7 - 1.9 - 66.7
2005 - 5.7 - 2.3 - 71.4
2006 - 9.1 - 2.0 - 81.8
2007 - 9.1 - 5.8 - 61.1
2008 - 16.2 - 2.8 - 85.2
2009 - 18.6 - 21.3 - 46.6
2010 - 28.4

Interestingly, you are more likely to stay over a year on FFN if you have stories and are a writer than if you were just a reader. However, you have an equal chance of staying on FFN for a year, writer or reader alike. Regardless, if you join FFN, chances are you will not write a story and you will not be on the site longer than six months.

Even if you have written a story, it is most probable that you will not be on the site longer than six months. This is a generous time period, and it could be that six months is the most probable activity lifespan because it is the starting point and anything smaller does not exist in this part.

We have worked on regression to give you an easy way to calculate the perspectives of staying on FFN. A fifth degree polynomial function seemed to have the biggest R^2=0.99. Amusingly, the probability would go down to negative 1700% very quickly after 8 years, so we had to switch to a simpler parabolic function with R^2=0.96.

Y=0,0218x2 - 0,2603x + 0,961

x - the number of years you have/are intending to stay on FFN. (Works for values up to 10 years).

y - % that you will stay.

According to the given function, it is least likely that you will stay on FFN for 6 years. Thus, yes, more likely that it will be 7 or 8. We attribute this to some form of fandom patriotism the earliest members have expressed to the site. A more precise function would have to include account deletions, which should, in reality, lower active account rates (remember the 3000 first accounts?) and the possibility of staying much longer than 8 years. In any case, the function above is presented for your amusement. A more informative variant is below.

We understand that it might be difficult to imagine the contextual difference between 6% and 9% dominant in the previous table. For this reason, we have made a coefficient, so 28.4%=1. This way, you will see more clearly how many active accounts die away, and how many stay active.

8 years 23%
7 years 30%
6 years 13%
5 years 20%
4 years 32%
3 years 32%
2 years 57%
1 year 65%
0 years 100%

The process can be done further if you want to see how many % of 65% et cetera die in the following years.

Active fanfic participating accounts (those that make up 12% on the site, remember that) lose 35% of their numbers in the first year. The second most rapid drop is in 3 years, but people who tend to stay 3 years are prone to staying 4. The last accurate piece of data that coerces with the trend: the more time passes, the less people stay, is 6 years. Only 1/8 of the people who are active writers right after joining remain this way. 7/8 chip off during the trip. As such, the number of permanent contributors (who stay on the site for years) increases as FFN grows. There is only one 'but': the increase is majorly consumed by users abandoning their accounts.

Those, who have spent less than 6 months account for 6.5% (29.5%) of the 22% of people that are active in any way. Another 7.3% (33.1%) come from those, who have spent more than a year. As such, it is reasonable to say that almost two thirds of the site is actively inhabited by inexperienced account owners, rated 'fans' in forums. So-called 'fanatics' make up a third of the active population, a third that spans since 1998 till the beginning of 2009. On the one hand, it is peculiar that the amount of active newbies (writers or just readers) is almost equal to that of 'fanatics'. On the other, it should make quality control out of the question. Why does it not even out? A question we leave in your hands, dear readers.

Conclusion

Unless FFN manages to speed up its growth potential, those 12% that currently shape the fandom will not be enough, especially because ~5 accounts are deleted every day. The site needs to replace more than 35% of active users every year, and 2010 so far looks the most challenging yet. More dedication, fellow fans. May the concept of fan fiction prosper.

Added: here is a list of user accounts in our sample.

Question: What about people who just go to forums, aren't they active?
Answer: They do not make use of the site's core service as a fan fiction archive. If you don't write or read stories, you are considered inactive. The only way a forum goer could be included as active (provided they have no stories or favourites) is if they updated their profile this year.