ALANN – the automated literary analysis neural network idea posed here some months ago (https://anonymole.wordpress.com/2016/09/25/so-you-wrote-a-novel/) “might” be realizable to some degree without the need for a DeepMind neural network.
There are no doubt certain aspects of writing (that this author is being slowly made aware of) that can be extracted as metrics from any writing.
Here are a few.
- Word counts and the ratio of those counts
- Verb count
- Adverb count
- Adjective count
- Proper name counts
- Single word counts
- Comma count
- Character length of words
- Sentence length and sentence complexity
- Quote counts and their dispersion throughout the text
- Certain word usages, active vs passive voice
- Jaggedness, how choppy is the the dialog vs narrative
Regarding word counts, what are the ratio’s of some word counts to others? What about common literary words vs the total count? Filter words, decorative, embellishment words vs the total?
Here’s some data (the means to acquire this data is below).
Let’s consider comma count to sentence count (Comma/Sent) as a measurement of “literary” intent. The higher the number the more lofty the writing (or the more Victorian…)
Charles Dickens’ Great Expectations had a Comma/Sent ratio of 200%. There were twice as many commas as periods.
Jack London’s White Fang, on the other hand, had a ratio of only 101%, there were about as many commas as periods.
If we examine the other writers and their works, this simple metrics *seems* to correlate with our expectations. HG Wells, Burroughs have lower “literary” quotients than Jane Austen or Herman Melville.
So, are there other factors that we can use to investigate the literary vs genre vs popular vs what-have-you aspects of novels? And, primarily, can we build a system that can judge them?
ALANN | ||||||
Title | Author | Comma/Sent | Excl/Sent | Semi/Sent | Dial/Sent | Sing/Word |
Adventures of Huckleberry Finn | Mark Twain | 164.99% | 10.41% | 31.99% | 32.51% | 2.50% |
Great Expectations | Charles Dickens | 200.07% | 11.56% | 14.75% | 46.07% | 2.02% |
Blue Across the Sea | AnonyMole | 93.06% | 2.44% | 0.76% | 32.48% | 4.02% |
Pride and Prejudice | Jane Austen | 147.77% | 8.07% | 24.89% | 28.58% | 2.05% |
Moby Dick | Herman Melville | 256.49% | 23.57% | 56.09% | 19.49% | na |
Tarzan of the apes | Edgar R Burroughs | 129.84% | 3.99% | 6.82% | 22.90% | 3.86% |
Sense and Sensibilities | Jane Austen | 200.85% | 11.32% | 32.03% | 31.40% | 2.07% |
Island of Dr. Moreau | HG Wells | 115.16% | 6.78% | 12.83% | 24.77% | 6.15% |
White Fang | Jack London | 101.10% | 1.98% | 4.81% | 10.27% | 4.15% |
Here’s a site I found to help kickstart this concept:
https://www.online-utility.org/text/analyzer.jsp
If we go to the Gutenberg Project and pick some books, let’s start with Adventure of Huckleberry Finn: https://www.gutenberg.org/ebooks/76
What is the comparison of the word “was” to the total word count?
Order | Unfiltered word count | Occurrences | Percentage |
1. | and | 6350 | 5.6714 |
2. | the | 4779 | 4.2683 |
3. | i | 3270 | 2.9206 |
4. | a | 3150 | 2.8134 |
5. | to | 2934 | 2.6205 |
6. | it | 2326 | 2.0774 |
7. | was | 2069 | 1.8479 |
8. | he | 1676 | 1.4969 |
9. | of | 1633 | 1.4585 |
10. | in | 1433 | 1.2799 |
11. | you | 1360 | 1.2147 |
12. | that | 1083 | 0.9673 |
13. | but | 1035 | 0.9244 |
14. | so | 961 | 0.8583 |
15. | on | 880 | 0.7860 |
16. | up | 861 | 0.7690 |
17. | all | 852 | 0.7610 |
18. | we | 848 | 0.7574 |
19. | for | 843 | 0.7529 |
20. | me | 823 | 0.7351 |
Now, what of the total single use words?
There were 2752 words used exactly once of a total of 110016 words which gives us a percentage of 2.5014%
Using this tool: https://jumk.de/wortanalyse/word-analysis.php
General:
110016 words
554985 characters (with space)
450511 characters (without space)
104474 spaces
421151 letters
14 numbers
29346 others
2509 blank lines
11430 line breaks
Punctuation marks:
4870 times . (dot)
8035 times , (comma) commas to periods ratio percentage: 164.98%
729 times ? (question mark)
507 times ! (exclamation mark) written energy bangs per period: 10.41%
426 times : (colon)
1558 times ; (semicolon) semicolons as a sign of sentence count: 31.99%
2973 times – (hyphen)
0 times / (slash)
3166 times ” (quote) dialog statements as a % of sentence count: 32.50%
5004 times ‘ (single quote)
Now, let’s test another literary work… Great Expectations
Order | Unfiltered word count | Occurrences | Percentage |
1. | the | 8145 | 4.3638 |
2. | and | 7092 | 3.7996 |
3. | i | 6475 | 3.4690 |
4. | to | 5152 | 2.7602 |
5. | of | 4437 | 2.3772 |
6. | a | 4047 | 2.1682 |
7. | in | 3026 | 1.6212 |
8. | that | 2986 | 1.5998 |
9. | was | 2836 | 1.5194 |
10. | it | 2670 | 1.4305 |
11. | he | 2206 | 1.1819 |
12. | you | 2186 | 1.1712 |
13. | had | 2093 | 1.1213 |
14. | my | 2069 | 1.1085 |
15. | me | 1998 | 1.0704 |
16. | his | 1860 | 0.9965 |
17. | as | 1774 | 0.9504 |
18. | with | 1760 | 0.9429 |
19. | at | 1637 | 0.8770 |
20. | on | 1420 | 0.7608 |
Unique word use count 3723 of 184378 words = 2.0192
General:
184378 words
973823 characters (with space)
805308 characters (without space)
168515 spaces
761634 letters
5 numbers
43669 others
4125 blank lines
20011 line breaks
Punctuation marks:
8522 times . (dot)
17050 times , (comma) commas to periods ratio percentage: 200.07%
1216 times ? (question mark)
985 times ! (exclamation mark) written energy bangs per period: 11.55%
105 times : (colon)
1257 times ; (semicolon) semicolons as a % of sentence count: 14.75%
3483 times – (hyphen)
0 times / (slash)
7852 times ” (quote) dialog statements as a % of sentence count: 46.06%
2512 times ‘ (single quote)
And here’s the novel I wrote Blue Across the Sea
Order | Unfiltered word count | Occurrences | Percentage |
1. | the | 6772 | 7.8213 |
2. | and | 2641 | 3.0502 |
3. | to | 2353 | 2.7176 |
4. | a | 1808 | 2.0881 |
5. | of | 1807 | 2.0870 |
6. | he | 1028 | 1.1873 |
7. | you | 1001 | 1.1561 |
8. | tillion | 974 | 1.1249 |
9. | in | 962 | 1.1111 |
10. | i | 839 | 0.9690 |
11. | it | 839 | 0.9690 |
12. | his | 796 | 0.9193 |
13. | that | 695 | 0.8027 |
14. | with | 650 | 0.7507 |
15. | they | 632 | 0.7299 |
16. | as | 623 | 0.7195 |
17. | from | 613 | 0.7080 |
18. | her | 588 | 0.6791 |
19. | we | 534 | 0.6167 |
20. | up | 514 | 0.5936 |
(“was” came in down around 280 instances…)
Unique word count 3469 out of 86219 = 4.0234
General:
86219 words
475623 characters (with space)
391243 characters (without space)
84380 spaces
370677 letters
4 numbers
20562 others
156 blank lines
2460 line breaks
Punctuation marks:
6595 times . (dot)
6137 times , (comma) commas to periods ratio percentage: 93.05%
644 times ? (question mark)
161 times ! (exclamation mark) written energy bangs per period: 2.44%
11 times : (colon)
50 times ; (semicolon) semicolons as a % of sentence count: 0.7581%
348 times – (hyphen)
1 times / (slash)
4284 times ” (quote) dialog statements as a % of sentence count: 32.47%
1912 times ‘ (single quote)
Here we will be adding additional literary works:
https://docs.google.com/spreadsheets/d/1Xop9GaBhjvXgA7dnupLlTin4VUNwNwPPrJWfGEMwxww/edit?usp=sharing
…
https://machinelearningknowledge.ai/openai-gpt-3-demos-to-convince-you-that-ai-threat-is-real-or-is-it/
LikeLike
https://www.technologyreview.com/2020/08/14/1006780/ai-gpt-3-fake-blog-reached-top-of-hacker-news/
~~~
A college kid’s fake, AI-generated blog fooled tens of thousands. This is how he made it.
“It was super easy actually,” he says, “which was the scary part.”
by Karen Hao archive page
August 14, 2020
At the start of the week, Liam Porr had only heard of GPT-3. By the end, the college student had used the AI model to produce an entirely fake blog under a fake name.
It was meant as a fun experiment. But then one of his posts reached the number-one spot on Hacker News. Few people noticed that his blog was completely AI-generated. Some even hit “Subscribe.”
While many have speculated about how GPT-3, the most powerful language-generating AI tool to date, could affect content production, this is one of the only known cases to illustrate the potential. What stood out most about the experience, says Porr, who studies computer science at the University of California, Berkeley: “It was super easy, actually, which was the scary part.”
GPT-3 is OpenAI’s latest and largest language AI model, which the San Francisco–based research lab began drip-feeding out in mid-July. In February of last year, OpenAI made headlines with GPT-2, an earlier version of the algorithm, which it announced it would withhold for fear it would be abused. The decision immediately sparked a backlash, as researchers accused the lab of pulling a stunt. By November, the lab had reversed position and released the model, saying it had detected “no strong evidence of misuse so far.”
The lab took a different approach with GPT-3; it neither withheld it nor granted public access. Instead, it gave the algorithm to select researchers who applied for a private beta, with the goal of gathering their feedback and commercializing the technology by the end of the year.
Porr submitted an application. He filled out a form with a simple questionnaire about his intended use. But he also didn’t wait around. After reaching out to several members of the Berkeley AI community, he quickly found a PhD student who already had access. Once the graduate student agreed to collaborate, Porr wrote a small script for him to run. It gave GPT-3 the headline and introduction for a blog post and had it spit out several completed versions. Porr’s first post (the one that charted on Hacker News), and every post after, was copy-and-pasted from one of the outputs with little to no editing.
“From the time that I thought of the idea and got in contact with the PhD student to me actually creating the blog and the first blog going viral—it took maybe a couple of hours,” he says.
A screenshot of one of Liam Porr’s fake blog posts at #1 on Hacker News.
Porr’s fake blog post, written under the fake name “adolos,” reaches #1 on Hacker News. Porr says he used three separate accounts to submit and upvote his posts on Hacker News in an attempt to push them higher. The admin said this strategy doesn’t work, but his click-baity headlines did.
The trick to generating content without the need for much editing was understanding GPT-3’s strengths and weaknesses. “It’s quite good at making pretty language, and it’s not very good at being logical and rational,” says Porr. So he picked a popular blog category that doesn’t require rigorous logic: productivity and self-help.
From there, he wrote his headlines following a simple formula: he’d scroll around on Medium and Hacker News to see what was performing in those categories and put together something relatively similar. “Feeling unproductive? Maybe you should stop overthinking,” he wrote for one. “Boldness and creativity trumps intelligence,” he wrote for another. On a few occasions, the headlines didn’t work out. But as long as he stayed on the right topics, the process was easy.
After two weeks of nearly daily posts, he retired the project with one final, cryptic, self-written message. Titled “What I would do with GPT-3 if I had no ethics,” it described his process as a hypothetical. The same day, he also posted a more straightforward confession on his real blog.
A screenshot of someone on Hacker News accusing the Porr’s blog post of being written by GPT-3. Another user responds that the comment “isn’t acceptable.”
The few people who grew suspicious of Porr’s fake blog were downvoted by other members in the community.
Porr says he wanted to prove that GPT-3 could be passed off as a human writer. Indeed, despite the algorithm’s somewhat weird writing pattern and occasional errors, only three or four of the dozens of people who commented on his top post on Hacker News raised suspicions that it might have been generated by an algorithm. All those comments were immediately downvoted by other community members.
For experts, this has long been the worry raised by such language-generating algorithms. Ever since OpenAI first announced GPT-2, people have speculated that it was vulnerable to abuse. In its own blog post, the lab focused on the AI tool’s potential to be weaponized as a mass producer of misinformation. Others have wondered whether it could be used to churn out spam posts full of relevant keywords to game Google.
Porr says his experiment also shows a more mundane but still troubling alternative: people could use the tool to generate a lot of clickbait content. “It’s possible that there’s gonna just be a flood of mediocre blog content because now the barrier to entry is so easy,” he says. “I think the value of online content is going to be reduced a lot.”
Porr plans to do more experiments with GPT-3. But he’s still waiting to get access from OpenAI. “It’s possible that they’re upset that I did this,” he says. “I mean, it’s a little silly.”
LikeLike
Microsoft, like some other tech companies, pays news organisations to use their content on its website.
But it employs journalists to decide which stories to display and how they are presented.
Around 50 contract news producers will lose their jobs at the end of June, the Seattle Times reports, but a team of full-time journalists will remain.
“It’s demoralising to think machines can replace us but there you go,” one of those facing redundancy told the paper.
Some sacked journalists warned that artificial intelligence may not be fully familiar with strict editorial guidelines, and could end up letting through inappropriate stories.
Twenty-seven of those losing their jobs are employed by the UK’s PA Media, the Guardian reports.
One journalist quoted in the paper said: “I spend all my time reading about how automation and AI is going to take all our jobs – now it’s taken mine.”
Microsoft is one of many tech companies experimenting with forms of so-called robot journalism to cut costs. Google is also investing in projects to understand how it might work.
LikeLike
LikeLike
https://aws.amazon.com/comprehend/
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. No machine learning experience required.
LikeLike
http://textalyser.net/index.php
LikeLike
https://aeon.co/essays/how-ai-is-revolutionising-the-role-of-the-literary-critic
Regarding the article, however, the author misses a huge aspect of literary analysis that continues to be ignored in this day and age of deep-wide neural network analysis: manuscript evaluation.
Both this:
https://anonymole.wordpress.com/2016/09/25/so-you-wrote-a-novel/
and this:
https://anonymole.wordpress.com/2016/12/04/alann-auto-lit-analysis-neural-net/
get into the concept that *tens of thousands* of new novels are written every year and must be evaluated by literary agents and publishers. There is a huge opportunity lurking here. The team that solves this issue is the team the can claim the right of best new author, hottest new best-sellers, NYTimes top-of-the-list novels for years to come.
Currently, the process of manuscript submission for evaluation is as archaic as they come. It’s pure alchemy performed by cloaked agents in tall towers protected by obscure ramparts and digital moats. This process must change. And the researchers in this article are some of those teams who could change it.
Who cares if computers will ever write Dickens or Austen? We have thousands of authors / today / who need the services of deep AI for evaluating their work. Myself included.
LikeLike