Zipf’s Law is the first mathematical pattern associated with the study of language. It is used in linguistics, informetrics, and more recently in copywriting. In mathematics, this law falls under the category of Pareto distribution, which is a statistical distribution method. It is thanks to this parameter that the analysis of texts from the perspective of language structure became possible. The law is based on the frequency of certain words used in content.

What are we talking about?

What is Zipf’s Law?

In 1949, a scientist named George Zipf noticed something interesting about words. He observed that people often use some words frequently, while others are used very rarely. It was a peculiar linguistic phenomenon: the most popular word is used twice as often as the second most popular word, and three times as often as the third. This means that a small portion of words is used constantly, while the vast majority is used very infrequently.

This principle doesn’t apply only to linguistics: Zipf discovered that it also applies to people’s incomes within a country. The richest person has twice as much money as the next richest person, and so on. This law also applies to the sizes of cities. The largest city in any country has twice the population of the next largest city, and so forth.

Now, let’s focus on copywriting.

If you arrange all the words in a large text by decreasing frequency of usage, the frequency of the nth word in such a list will be approximately inversely proportional to its rank, which is also called its order number, n. The second most frequently used word occurs approximately half as often as the first, the third occurs approximately one-third as often as the first, and so on. This is how Zipf’s Law works in linguistics and in copywriting. The main goal of applying this law in creating website content is to distribute words in an article in a way that makes it easy to read and sound natural.

Zipfs law

We’ve probably all encountered product descriptions or advice in informational blogs on the internet that are overloaded with keywords, devoid of grammatical correctness, and lacking logical coherence. The authors of such articles not only engage in keyword stuffing and worsen page rankings in search results but also ignore Zipf’s Law. Therefore, they artificially distort content, which, due to oversaturation with repetitions and related words, only repels website visitors.

“Zipf’s Law demonstrates that there are predictable proportions in text written in natural language. Deviations from typical proportions are easy to detect. Thus, identifying over-optimized text that is ‘unnatural’ is not a difficult task.”—“How search engines understand human language” by Yauhen Khutarniuk

Zipf’s Law is a mathematical rule that helps us understand the frequency of words in a text. It is used to check if a text follows a natural language pattern or if it’s structured unnaturally. We can use some formulas to evaluate a text using Zipf’s Law:

FR = C,
F = C / R, 

where FR represents Frequency, and C represents the Count of how many times a word appears.

Let’s look at how we calculate this using the second formula: C/R=F.

Imagine we have a blog post about “LSI Copywriting” with three main keywords: “What is LSI copywriting”, “Tools for gathering LSI phrases”, and “Examples of LSI articles”. If the first keyword appears 10 times in the text, we can calculate the frequency of the second keyword as follows:

  • 10 (С) / 2 (R) = 5 (F)

Similarly, we calculate the frequency of the third keyword as follows:

  • 10 (С) / 3 (R) ≈ 3,3 (F)

The visualization of Zipf’s Law can be represented as a curve on a graph. This curve sharply descends, and in the lower part of the graph, it almost becomes a horizontal line. The most frequently used words are found in the descending part of this hyperbolic curve, while the less commonly used words are located at the bottom, closer to the horizontal line.

Zipf's Law chart
The relationship between the frequency of words in usage (vertical line) and the position of the word in a frequency dictionary, also known as the word’s rank (horizontal line).

So, the most common word or phrase in a text occurs twice as often as the second most frequent word, three times as often as the word ranked next in frequency, and so on, down to the least frequently used.

This law is not limited to linguistics; it is universal. If we create a list of cities in Ukraine starting with the most populous, then the city with the largest population in any country is twice as large as the next largest city, and so on. Considering pre-war period data, Kyiv ranks first in such a ranking with a population of 2,967,360 people, and Kharkiv is second with 1,443,207 people. If we didn’t have data for the city in the second position but wanted to estimate its population independently, we could use the same formula:

  • 2 967 360 (С) / 2 (R) = 1 483 680 (F).

The number may not align exactly with the actual 2020 figure but is the closest approximation. In the same way, you could calculate the population of the third most populous city in Ukraine and continue down the list.

Zipf’s Law, the Voynich Manuscript, and Economics

Thanks to Zipf’s Law, it is possible to determine the presence of content in an encrypted message, as confirmed by numerous empirical studies. It is known that as a result of the statistical analysis of the Voynich Manuscript, written in an unknown language, it was proven that the text carries certain information. This is evidenced by the natural structure of the text, and the repetition of words in it conforms to Zipf’s Law.

Voynich manuscript
As a result of statistical analysis of the Voynich Manuscript, written in an unknown language, it has been shown that the work contains certain information.

Zipf’s law has found applications in various systems within economics and other fields. In linguistics, it is used to study or improve various texts and even programming languages like Java, C, and others. However, it doesn’t apply to languages like Chinese, Japanese, and Korean with limited vocabulary size. Zipf’s law is primarily observed in the Indo-European language family.

“Zipf’s Law is much more than just another peculiar linguistic phenomenon”. From “Zipfs Law & Zipfian Distribution in SEO” a presentation by Dawn Anderson.

Zipf's law quote

By the way, reach and the number of likes on social media also follow Zipf’s law. If the most popular post on a blogger’s page received 500 likes, it’s likely that the second-place post will have around 250 “hearts”. This pattern is observed in economics, marketing, and various social and commercial processes.

The history of the emergence of Zipf’s law

Jean-Baptiste Estu, a French stenographer, was the first to describe the regularity of word and phrase placement in his work “The Stenography Range” in 1908.

Stenography or shorthand is a fast writing system characterized by the recording of short symbols and abbreviations, allowing for the simultaneous transcription of spoken language”.Stenography,” Wikipedia

The first practical application of this law was found in 1913 in Felix Auerbach’s work “The Law of Population Concentration”. In this work, the German physicist described the natural rules of city distribution by size.

The distribution of cities by size is well explained by Zipf's law
The distribution of cities by size nicely explains Zipf’s Law.

In 1949, American linguist George Zipf proposed applying his law to statistical research in the fields of economics, sociology, and article writing. As an example, he described the universal distribution of people’s incomes, calculated by the formula FR = C or F = C / R.

The richest person in the country has twice as much money as the one in the second place, the third person has three times less wealth than the first, and so on. From 1926 to 1936, this pattern was confirmed in England, France, Denmark, the Netherlands, Finland, Germany, and the USA.

It is worth noting that Zipf studied the law, later named after him, and popularized it while improving foreign language teaching methodology. He concluded that to master a language, one needs to know the most common words. Only after this foundation can additional vocabulary be learned, which is used more for embellishing written and spoken language. Learning any language through alphabetical dictionaries is impossible — only through frequency dictionaries, which emerged relatively recently in the early 20th century.

The most used words in English for Zipf's law
The most commonly used words in the English language

Since 1932, during his teaching career at Harvard University, George Zipf became interested in the frequency of word usage in language. While comparing the distribution of words in the Chinese and Latin languages, he and his students determined that the product of a word’s frequency and its position in the frequency list is an almost constant value. The specific value of this constant depends on the particular language.

Zipf’s Law is utilized in informatics, the science of mathematical, statistical methods, and models, including the identification of regularities. French-American mathematician Benoît Mandelbrot believed that increasing the number of words during communication makes it longer, but at the same time reduces the probability of errors in message transmission. More detailed information eliminates the need for clarifications and repetitions, ultimately saving time in data exchange. Thus, the scientist explained Zipf’s Law in terms of information transmission.


What should a text be like according to Zipf’s Law?

You can check any article for compliance with Zipf’s Law using special online services or by manually counting using Excel tables.

If the results of the verification show that the first-ranked word appears 30 times, the second-ranked word 29 times, and the third-ranked word 20 times, it is worth revising the text. It is likely that there are too many repetitions of certain words in it. According to Zipf, the number of occurrences of the second word should not be more than 15, and the third word should occur up to 10 times.

Keep in mind that a high Zipf score is not the only and definitive criterion for the quality of the text. Sometimes, excessive changes to the article to comply with the law can make it lose its readability. However, content should primarily be clear and valuable to readers. Technical metrics always come second.

How to improve a text while adhering to Zipf’s Law?

  • Choose a set of synonyms for each key word and use them instead of repetitions and cognates.
  • In commercial texts, avoid overusing words like “buy”, “price”, “promotion”, etc.
  • Minimize the use of clichés, stamps, and overused phrases that can be found on almost every other website.
  • Avoid padding the text to increase its length. Provide readers with more facts, statistics, and unique stories.
  • Enhance the text with images, videos, infographics, graphs, and tables.
  • Thoroughly explore the topic and study more than 10 sources to prepare the article. Avoid rewriting without a deep understanding of the subjects you describe. Focus not only on keywords for search engine optimization but also on LSI keywords that are useful both for page ranking and content quality.
  • Pay attention to the structure, and don’t forget about paragraphs, bulleted or numbered lists, headings, and subheadings.
  • Use the GPT chat as a reference guide and avoid copying content from the generator for publication on your website.
  • Whenever possible, write texts with a length of over 5,000 characters without spaces. The Zipf’s Law analysis method is most often used for long reads with a large number of keywords.

It’s better to prepare for writing a content-rich text with quality keyword placement than to edit the article after verification. To do this, determine in advance which words should be used most frequently. They should be relevant to the queries of your target website visitors. Based on this, create lists of words that are semantically related to the main topic.

Semantic models capture synonyms, related words, and semantic frames well. A semantic frame is a set of words that represent perspectives or participants in a certain type of event. For example, a semantic frame like “5 o’clock tea” may include words such as “traditions”, “tea”, “cup”, “teapot”, “spoon”, “sugar”, “beverage”, “brewing”, and so on.

When creating fresh content, it can be helpful to think in terms of semantic frames. In other words, consider the semantic frame you want your page to be ranked for, rather than just a specific keyword.

Source: “How search engines understand human language” by Yauhen Khutarniuk.

What connects Zipf’s Law and Pareto’s Law?

From the perspective of mathematical statistics, Zipf’s Law is a type of Pareto distribution. Italian engineer, economist, and sociologist Vilfredo Pareto formulated the “80/20” rule: 80% of any result requires 20% of the effort, while the remaining 20% of the result may require 80% of the effort.

In marketing and commerce, the Pareto principle is well-known for stating that 80% of profits come from 20% of customers. When it comes to the lexical content of texts, roughly 20% of words in a language account for 80% of their usage.

In simple terms, a specific group of popular words is used frequently, while the majority of the vocabulary is used very rarely. This does not apply to functional words without semantic meaning, as they do not convey the essence of the text.

Words in a text can be divided into three categories

  1. Auxiliary: Parts of speech that are used frequently but are not considered by search engines.
  2. Occasional: These words appear quite often, do not describe the topic of the article, and have minimal impact on SEO.
  3. Important: These words are relatively rare compared to the first two categories but are crucial in search engine identification of the web page’s topic and are perceived by search engines as keywords.

Important words make up about 20% of the text, and they are responsible for 80% of the content and content promotion. They make the text understandable for Google, while the other 80% serve exclusively for readers. It’s important to form semantic frames and choose synonyms for these words.

The significance of Zipf’s Law in copywriting

Zipf’s Law is important for copywriters, marketers, and SEO specialists. It is closely related to the concept of keyword stuffing and, therefore, text quality.

Analyzing an article for compliance with Pareto’s Law can help avoid improper use of keywords.

Although the frequency of important words is crucial for evaluating the text, this regularity does not influence certain aspects of copywriting.

  • The content of a text, especially in the case of literary works, may not always adhere to Zipf’s Law due to the use of artistic devices. However, this deviation from the law does not necessarily diminish the readability and naturalness of the text.
  • The professionalism of the author is crucial in determining the quality of a text. Not all texts that adhere to Zipf’s Law by 50% or more are well-written. Grammatical errors, awkward formulations, or content copied from external sources can negatively impact the text’s reception by search engines and website visitors, even if there are no repetitions.
  • The uniqueness of the article is important. A 100% original text may have low Zipf’s Law scores. Remember that the value of content lies primarily in its usefulness to readers, so it’s essential not to overlook other quality criteria for the sake of achieving the perfect word frequency distribution in the text.
  • Website promotion is a crucial aspect. While keyword stuffing can harm a website’s ranking, articles with low Zipf’s Law scores may still appear on the first page of Google search results. This shows that achieving the perfect Zipf’s distribution is not the sole factor influencing search engine rankings.

The law allows for controlling the number of stop words. It helps ensure a logical sequence of keywords and reduces the repetition that can annoy readers and increase the semantic “tediousness” of the text.

It’s not by chance that in technical tasks for copywriters, you often encounter a requirement to use the first keyword several times. Typically, it is included in the first paragraph of the text (sometimes even in the last one), in the title, and the descriptor. A set of six key queries is excellent. However, the most effective is the first keyword, the second one yielding half the results, the third one a third, and so on. The Keyword Efficiency Index (KEI) is related to the fundamental principle of Zipf’s Law.

A video that simply explains the essence of Zipf’s Law

Conclusions

Zipf’s linguistic law is a regularity that states that the ratio of a word’s rank in a frequency dictionary to the word’s frequency in speech and writing is a constant value (constant).

This rule extends to various areas of human activity, including economic and sociological phenomena. In linguistics, this law found application in the 20th century. According to it, the most popular word in a text or language as a whole occurs twice as often as the second most popular word, three times as often as the third, and so on.

For modern texts, this regularity plays an important role as it helps copywriters, marketers, and SEO specialists create readable, interesting, and unique articles written in plain language.

Unlike SEO strategy, which relies on using as many keywords as possible, the principle of writing articles according to Zipf’s law involves searching for and using the most popular and relevant keywords for the page’s theme.

Tags of this page:
Leave a comment

Your email address will not be published. Required fields are marked *