ALANN – Auto Lit Analysis Neural Net

ALANN – the automated literary analysis neural network idea posed here some months ago (https://anonymole.wordpress.com/2016/09/25/so-you-wrote-a-novel/)  “might” be realizable to some degree without the need for a DeepMind neural network.

There are no doubt certain aspects of writing (that this author is being slowly made aware of) that can be extracted as metrics from any writing.

Here are a few.

  • Word counts and the ratio of those counts
    • Verb count
    • Adverb count
    • Adjective count
    • Proper name counts
    • Single word counts
  • Comma count
  • Character length of words
  • Sentence length and sentence complexity
  • Quote counts and their dispersion throughout the text
  • Certain word usages, active vs passive voice
  • Jaggedness, how choppy is the the dialog vs narrative

Regarding word counts, what are the ratio’s of some word counts to others? What about common literary words vs the total count? Filter words, decorative, embellishment words vs the total?


Here’s some data (the means to acquire this data is below).

Let’s consider comma count to sentence count (Comma/Sent) as a measurement of “literary” intent. The higher the number the more lofty the writing (or the more Victorian…)

Charles Dickens’ Great Expectations had a Comma/Sent ratio of 200%. There were twice as many commas as periods.

Jack London’s White Fang, on the other hand, had a ratio of only 101%, there were about as many commas as periods.

If we examine the other writers and their works, this simple metrics *seems* to correlate with our expectations. HG Wells, Burroughs have lower “literary” quotients than Jane Austen or Herman Melville.

So, are there other factors that we can use to investigate the literary vs genre vs popular vs what-have-you aspects of novels? And, primarily, can we build a system that can judge them?

ALANN
Title Author Comma/Sent Excl/Sent Semi/Sent Dial/Sent Sing/Word
Adventures of Huckleberry Finn Mark Twain 164.99% 10.41% 31.99% 32.51% 2.50%
Great Expectations Charles Dickens 200.07% 11.56% 14.75% 46.07% 2.02%
Blue Across the Sea AnonyMole 93.06% 2.44% 0.76% 32.48% 4.02%
Pride and Prejudice Jane Austen 147.77% 8.07% 24.89% 28.58% 2.05%
Moby Dick Herman Melville 256.49% 23.57% 56.09% 19.49% na
Tarzan of the apes Edgar R Burroughs 129.84% 3.99% 6.82% 22.90% 3.86%
Sense and Sensibilities Jane Austen 200.85% 11.32% 32.03% 31.40% 2.07%
Island of Dr. Moreau HG Wells 115.16% 6.78% 12.83% 24.77% 6.15%
White Fang Jack London 101.10% 1.98% 4.81% 10.27% 4.15%

 

Here’s a site I found to help kickstart this concept:

https://www.online-utility.org/text/analyzer.jsp

If we go to the Gutenberg Project and pick some books, let’s start with Adventure of Huckleberry Finn: https://www.gutenberg.org/ebooks/76

What is the comparison of the word “was” to the total word count?

Order Unfiltered word count Occurrences Percentage
1. and 6350 5.6714
2. the 4779 4.2683
3. i 3270 2.9206
4. a 3150 2.8134
5. to 2934 2.6205
6. it 2326 2.0774
7. was 2069 1.8479
8. he 1676 1.4969
9. of 1633 1.4585
10. in 1433 1.2799
11. you 1360 1.2147
12. that 1083 0.9673
13. but 1035 0.9244
14. so 961 0.8583
15. on 880 0.7860
16. up 861 0.7690
17. all 852 0.7610
18. we 848 0.7574
19. for 843 0.7529
20. me 823 0.7351

Now, what of the total single use words?

There were 2752 words used exactly once of a total of 110016 words which gives us a percentage of 2.5014%

Using this tool: https://jumk.de/wortanalyse/word-analysis.php

General:
110016 words
554985 characters (with space)
450511 characters (without space)
104474 spaces
421151 letters
14 numbers
29346 others
2509 blank lines
11430 line breaks

Punctuation marks:
4870 times . (dot)
8035 times , (comma) commas to periods ratio percentage: 164.98%
729 times ? (question mark)
507 times ! (exclamation mark) written energy bangs per period: 10.41%
426 times : (colon)
1558 times ; (semicolon) semicolons as a sign of sentence count: 31.99%
2973 times – (hyphen)
0 times / (slash)
3166 times ” (quote) dialog statements as a % of sentence count: 32.50%
5004 times ‘ (single quote)



Now, let’s test another literary work… Great Expectations

Order Unfiltered word count Occurrences Percentage
1. the 8145 4.3638
2. and 7092 3.7996
3. i 6475 3.4690
4. to 5152 2.7602
5. of 4437 2.3772
6. a 4047 2.1682
7. in 3026 1.6212
8. that 2986 1.5998
9. was 2836 1.5194
10. it 2670 1.4305
11. he 2206 1.1819
12. you 2186 1.1712
13. had 2093 1.1213
14. my 2069 1.1085
15. me 1998 1.0704
16. his 1860 0.9965
17. as 1774 0.9504
18. with 1760 0.9429
19. at 1637 0.8770
20. on 1420 0.7608

Unique word use count 3723 of 184378 words = 2.0192

General:
184378 words
973823 characters (with space)
805308 characters (without space)
168515 spaces
761634 letters
5 numbers
43669 others
4125 blank lines
20011 line breaks

Punctuation marks:
8522 times . (dot)
17050 times , (comma) commas to periods ratio percentage:  200.07%
1216 times ? (question mark)
985 times ! (exclamation mark) written energy bangs per period: 11.55%
105 times : (colon)
1257 times ; (semicolon) semicolons as a % of sentence count: 14.75%
3483 times – (hyphen)
0 times / (slash)
7852 times ” (quote)  dialog statements as a % of sentence count: 46.06%
2512 times ‘ (single quote)



And here’s the novel I wrote Blue Across the Sea

Order Unfiltered word count Occurrences Percentage
1. the 6772 7.8213
2. and 2641 3.0502
3. to 2353 2.7176
4. a 1808 2.0881
5. of 1807 2.0870
6. he 1028 1.1873
7. you 1001 1.1561
8. tillion 974 1.1249
9. in 962 1.1111
10. i 839 0.9690
11. it 839 0.9690
12. his 796 0.9193
13. that 695 0.8027
14. with 650 0.7507
15. they 632 0.7299
16. as 623 0.7195
17. from 613 0.7080
18. her 588 0.6791
19. we 534 0.6167
20. up 514 0.5936

(“was” came in down around 280 instances…)

Unique word count 3469 out of 86219 = 4.0234

General:
86219 words
475623 characters (with space)
391243 characters (without space)
84380 spaces
370677 letters
4 numbers
20562 others
156 blank lines
2460 line breaks

Punctuation marks:
6595 times . (dot)
6137 times , (comma) commas to periods ratio percentage:  93.05%
644 times ? (question mark)
161 times ! (exclamation mark) written energy bangs per period: 2.44%
11 times : (colon)
50 times ; (semicolon) semicolons as a % of sentence count: 0.7581%
348 times – (hyphen)
1 times / (slash)
4284 times ” (quote) dialog statements as a % of sentence count: 32.47%
1912 times ‘ (single quote)

Here we will be adding additional literary works:

https://docs.google.com/spreadsheets/d/1Xop9GaBhjvXgA7dnupLlTin4VUNwNwPPrJWfGEMwxww/edit?usp=sharing


7 thoughts on “ALANN – Auto Lit Analysis Neural Net

  1. https://www.technologyreview.com/2020/08/14/1006780/ai-gpt-3-fake-blog-reached-top-of-hacker-news/

    ~~~

    A college kid’s fake, AI-generated blog fooled tens of thousands. This is how he made it.

    “It was super easy actually,” he says, “which was the scary part.”
    by Karen Hao archive page

    August 14, 2020

    At the start of the week, Liam Porr had only heard of GPT-3. By the end, the college student had used the AI model to produce an entirely fake blog under a fake name.

    It was meant as a fun experiment. But then one of his posts reached the number-one spot on Hacker News. Few people noticed that his blog was completely AI-generated. Some even hit “Subscribe.”

    While many have speculated about how GPT-3, the most powerful language-generating AI tool to date, could affect content production, this is one of the only known cases to illustrate the potential. What stood out most about the experience, says Porr, who studies computer science at the University of California, Berkeley: “It was super easy, actually, which was the scary part.”

    GPT-3 is OpenAI’s latest and largest language AI model, which the San Francisco–based research lab began drip-feeding out in mid-July. In February of last year, OpenAI made headlines with GPT-2, an earlier version of the algorithm, which it announced it would withhold for fear it would be abused. The decision immediately sparked a backlash, as researchers accused the lab of pulling a stunt. By November, the lab had reversed position and released the model, saying it had detected “no strong evidence of misuse so far.”

    The lab took a different approach with GPT-3; it neither withheld it nor granted public access. Instead, it gave the algorithm to select researchers who applied for a private beta, with the goal of gathering their feedback and commercializing the technology by the end of the year.

    Porr submitted an application. He filled out a form with a simple questionnaire about his intended use. But he also didn’t wait around. After reaching out to several members of the Berkeley AI community, he quickly found a PhD student who already had access. Once the graduate student agreed to collaborate, Porr wrote a small script for him to run. It gave GPT-3 the headline and introduction for a blog post and had it spit out several completed versions. Porr’s first post (the one that charted on Hacker News), and every post after, was copy-and-pasted from one of the outputs with little to no editing.

    “From the time that I thought of the idea and got in contact with the PhD student to me actually creating the blog and the first blog going viral—it took maybe a couple of hours,” he says.
    A screenshot of one of Liam Porr’s fake blog posts at #1 on Hacker News.
    Porr’s fake blog post, written under the fake name “adolos,” reaches #1 on Hacker News. Porr says he used three separate accounts to submit and upvote his posts on Hacker News in an attempt to push them higher. The admin said this strategy doesn’t work, but his click-baity headlines did.

    The trick to generating content without the need for much editing was understanding GPT-3’s strengths and weaknesses. “It’s quite good at making pretty language, and it’s not very good at being logical and rational,” says Porr. So he picked a popular blog category that doesn’t require rigorous logic: productivity and self-help.

    From there, he wrote his headlines following a simple formula: he’d scroll around on Medium and Hacker News to see what was performing in those categories and put together something relatively similar. “Feeling unproductive? Maybe you should stop overthinking,” he wrote for one. “Boldness and creativity trumps intelligence,” he wrote for another. On a few occasions, the headlines didn’t work out. But as long as he stayed on the right topics, the process was easy.

    After two weeks of nearly daily posts, he retired the project with one final, cryptic, self-written message. Titled “What I would do with GPT-3 if I had no ethics,” it described his process as a hypothetical. The same day, he also posted a more straightforward confession on his real blog.
    A screenshot of someone on Hacker News accusing the Porr’s blog post of being written by GPT-3. Another user responds that the comment “isn’t acceptable.”
    The few people who grew suspicious of Porr’s fake blog were downvoted by other members in the community.

    Porr says he wanted to prove that GPT-3 could be passed off as a human writer. Indeed, despite the algorithm’s somewhat weird writing pattern and occasional errors, only three or four of the dozens of people who commented on his top post on Hacker News raised suspicions that it might have been generated by an algorithm. All those comments were immediately downvoted by other community members.

    For experts, this has long been the worry raised by such language-generating algorithms. Ever since OpenAI first announced GPT-2, people have speculated that it was vulnerable to abuse. In its own blog post, the lab focused on the AI tool’s potential to be weaponized as a mass producer of misinformation. Others have wondered whether it could be used to churn out spam posts full of relevant keywords to game Google.

    Porr says his experiment also shows a more mundane but still troubling alternative: people could use the tool to generate a lot of clickbait content. “It’s possible that there’s gonna just be a flood of mediocre blog content because now the barrier to entry is so easy,” he says. “I think the value of online content is going to be reduced a lot.”

    Porr plans to do more experiments with GPT-3. But he’s still waiting to get access from OpenAI. “It’s possible that they’re upset that I did this,” he says. “I mean, it’s a little silly.”

    Like

  2. Microsoft, like some other tech companies, pays news organisations to use their content on its website.

    But it employs journalists to decide which stories to display and how they are presented.

    Around 50 contract news producers will lose their jobs at the end of June, the Seattle Times reports, but a team of full-time journalists will remain.

    “It’s demoralising to think machines can replace us but there you go,” one of those facing redundancy told the paper.

    Some sacked journalists warned that artificial intelligence may not be fully familiar with strict editorial guidelines, and could end up letting through inappropriate stories.

    Twenty-seven of those losing their jobs are employed by the UK’s PA Media, the Guardian reports.

    One journalist quoted in the paper said: “I spend all my time reading about how automation and AI is going to take all our jobs – now it’s taken mine.”

    Microsoft is one of many tech companies experimenting with forms of so-called robot journalism to cut costs. Google is also investing in projects to understand how it might work.

    Like

  3. Title Dialog/Sent
    Adventures of Huckleberry Finn 32.51%
    Great Expectations 46.07%
    Blue Across the Sea 32.48%
    Pride and Prejudice 24.73%
    Moby Dick 19.49%
    Tarzan of the apes 22.90%
    Sense and Sensibilities 31.40%
    Island of Dr. Moreau 24.77%
    White Fang 10.27%
    Persuasion 21.35%
    The Gribble’s Eye 31.38%
    Tale of Two Cities 42.88%
    PH-SepSceneWriMo 35.79%

    Like

  4. https://aeon.co/essays/how-ai-is-revolutionising-the-role-of-the-literary-critic

    Regarding the article, however, the author misses a huge aspect of literary analysis that continues to be ignored in this day and age of deep-wide neural network analysis: manuscript evaluation.

    Both this:

    https://anonymole.wordpress.com/2016/09/25/so-you-wrote-a-novel/
    and this:
    https://anonymole.wordpress.com/2016/12/04/alann-auto-lit-analysis-neural-net/

    get into the concept that *tens of thousands* of new novels are written every year and must be evaluated by literary agents and publishers. There is a huge opportunity lurking here. The team that solves this issue is the team the can claim the right of best new author, hottest new best-sellers, NYTimes top-of-the-list novels for years to come.

    Currently, the process of manuscript submission for evaluation is as archaic as they come. It’s pure alchemy performed by cloaked agents in tall towers protected by obscure ramparts and digital moats. This process must change. And the researchers in this article are some of those teams who could change it.

    Who cares if computers will ever write Dickens or Austen? We have thousands of authors / today / who need the services of deep AI for evaluating their work. Myself included.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s