0% found this document useful (0 votes)

15 views

The Critical Window of Shadow Librarie

Uploaded by

pijamagika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

The Critical Window of Shadow Librarie

Uploaded by

pijamagika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Anna’s Blog

Updates about Anna’s Archive, the largest truly open library in human history.

The critical window of shadow libraries

annas-archive.li/blog, 2024-07-16, Chinese version 中文版, discuss on Reddit, Hacker News

At Anna’s Archive, we are often asked how we can claim to preserve our collections in
perpetuity, when the total size is already approaching 1 Petabyte (1000 TB), and is still
growing. In this article we’ll look at our philosophy, and see why the next decade is critical for
our mission of preserving humanity’s knowledge and culture.

The total size of our collections, over the last few months, broken down by number of torrent seeders.

Priorities
Why do we care so much about papers and books? Let’s set aside our fundamental belief in
preservation in general — we might write another post about that. So why papers and books
specifically? The answer is simple: information density.

Per megabyte of storage, written text stores the most information out of all media. While we
care about both knowledge and culture, we do care more about the former. Overall, we find a
hierarchy of information density and importance of preservation that looks roughly like this:

Academic papers, journals, reports

Organic data like DNA sequences, plant seeds, or microbial samples
Non-fiction books
Science & engineering software code
Measurement data like scientific measurements, economic data, corporate reports
Science & engineering websites, online discussions
Non-fiction magazines, newspapers, manuals
Non-fiction transcripts of talks, documentaries, podcasts
Internal data from corporations or governments (leaks)
Metadata records generally (of non-fiction and fiction; of other media, art, people, etc;
including reviews)
Geographic data (e.g. maps, geological surveys)
Transcripts of legal or court proceedings
Fictional or entertainment versions of all of the above

The ranking in this list is somewhat arbitrary — several items are ties or have disagreements
within our team — and we’re probably forgetting some important categories. But this is
roughly how we prioritize.

Some of these items are too different from the others for us to worry about (or are already
taken care of by other institutions), such as organic data or geographic data. But most of the
items in this list are actually important to us.

Another big factor in our prioritization is how much at risk a certain work is. We prefer to
focus on works that are:

Rare
Uniquely underfocused
Uniquely at risk of destruction (e.g. by war, funding cuts, lawsuits, or political
persecution)

Finally, we care about scale. We have limited time and money, so we’d rather spend a month
saving 10,000 books than 1,000 books — if they’re about equally valuable and at risk.

Shadow libraries
There are many organizations that have similar missions, and similar priorities. Indeed, there
are libraries, archives, labs, museums, and other institutions tasked with preservation of this
kind. Many of those are well-funded, by governments, individuals, or corporations. But they
have one massive blind spot: the legal system.

Herein lies the unique role of shadow libraries, and the reason Anna’s Archive exists. We can
do things that other institutions are not allowed to do. Now, it’s not (often) that we can
archive materials that are illegal to preserve elsewhere. No, it’s legal in many places to build
an archive with any books, papers, magazines, and so on.

But what legal archives often lack is redundancy and longevity. There exist books of which
only one copy exists in some physical library somewhere. There exist metadata records
guarded by a single corporation. There exist newspapers only preserved on microfilm in a
single archive. Libraries can get funding cuts, corporations can go bankrupt, archives can be
bombed and burned to the ground. This is not hypothetical — this happens all the time.

The thing we can uniquely do at Anna’s Archive is store many copies of works, at scale. We
can collect papers, books, magazines, and more, and distribute them in bulk. We currently do
this through torrents, but the exact technologies don’t matter and will change over time. The
important part is getting many copies distributed across the world. This quote from over 200
years ago still rings true:

“The lost cannot be recovered; but let us save what remains: not by vaults and locks which
fence them from the public eye and use, in consigning them to the waste of time, but by
such a multiplication of copies, as shall place them beyond the reach of
accident.” — Thomas Jefferson, 1791

A quick note about public domain. Since Anna’s Archive uniquely focus on activities that are
illegal in many places around the world, we don’t bother with widely available collections,
such as public domain books. Legal entities often already take good care of that. However,
there are considerations which make us sometimes work on publicly available collections:

Metadata records can be freely viewed on the Worldcat website, but not downloaded in
bulk (until we scraped them)
Code can be open source on Github, but Github as a whole cannot be easily mirrored
and thus preserved (though in this particular case there are sufficiently distributed
copies of most code repositories)
Reddit is free to use, but has recently put up stringent anti-scraping measures, in the
wake of data-hungry LLM training (more about that later)

A multiplication of copies
Back to our original question: how can we claim to preserve our collections in perpetuity? The
main problem here is that our collection has been growing at a rapid clip, by scraping and
open-sourcing some massive collections (on top of the amazing work already done by other
open-data shadow libraries like Sci-Hub and Library Genesis).

This growth in data makes it harder for the collections to be mirrored around the world. Data
storage is expensive! But we are optimistic, especially when observing the following three
trends.

1. We’ve plucked the low-hanging fruit

This one follow directly from our priorities discussed above. We prefer to work on liberating
large collections first. Now that we’ve secured some of the largest collections in the world, we
expect our growth to be much slower.

There is still a long tail of smaller collections, and new books get scanned or published every
day, but the rate will likely be much slower. We might still double or even triple in size, but
over a longer time period.

2. Storage costs continue to drop exponentially

As of the time of writing, disk prices per TB are around $12 for new disks, $8 for used disks,
and $4 for tape. If we’re conservative and look only at new disks, that means that storing a
petabyte costs about $12,000. If we assume our library will triple from 900TB to 2.7PB, that
would mean $32,400 to mirror our entire library. Adding electricity, cost of other hardware,
and so on, let’s round it up to $40,000. Or with tape more like $15,000–$20,000.

On one hand $15,000–$40,000 for the sum of all human knowledge is a steal. On the
other hand, it is a bit steep to expect tons of full copies, especially if we’d also like those
people to keep seeding their torrents for the benefit of others.

That is today. But progress marches forwards:

Hard drive costs per TB have been roughly slashed in third over the last 10 years, and will
likely continue to drop at a similar pace. Tape appears to be on a similar trajectory. SSD prices
are dropping even faster, and might take over HDD prices by the end of the decade.
Historical Cost of Computer Memory and Storage
(This data is expressed in US dollars per terabyte (TB). It is not adjusted for inflation)
100 trillion $/TB

1 trillion $/TB

10 billion $/TB

100 million $/TB

1 million $/TB

10,000 $/TB

Memory

100 $/TB Flash

Solid state
Disk
1956 1960 1970 1980 1990 2000 2010 2022

HDD price trends from different sources (click to view study).

If this holds, then in 10 years we might be looking at only $5,000–$13,000 to mirror our entire
collection (1/3rd), or even less if we grow less in size. While still a lot of money, this will be
attainable for many people. And it might be even better because of the next point…

3. Improvements in information density

We currently store books in the raw formats that they are given to us. Sure, they are
compressed, but often they are still large scans or photographs of pages.

Until now, the only options to shrink the total size of our collection has been through more
aggressive compression, or deduplication. However, to get significant enough savings, both
are too lossy for our taste. Heavy compression of photos can make text barely readable. And
deduplication requires high confidence of books being exactly the same, which is often too
inaccurate, especially if the contents are the same but the scans are made on different
occasions.

There has always been a third option, but its quality has been so abysmal that we never
considered it: OCR, or Optical Character Recognition. This is the process of converting
photos into plain text, by using AI to detect the characters in the photos. Tools for this have
long existed, and have been pretty decent, but “pretty decent” is not enough for preservation
purposes.

However, recent multi-modal deep-learning models have made extremely rapid progress,
though still at high costs. We expect both accuracy and costs to improve dramatically in
coming years, to the point where it will become realistic to apply to our entire library.

OCR improvements.

When that happens, we will likely still preserve the original files, but in addition we could
have a much smaller version of our library that most people will want to mirror. The kicker is
that raw text itself compresses even better, and is much easier to deduplicate, giving us even
more savings.

Overall it’s not unrealistic to expect at least a 5-10x reduction in total file size, perhaps even
more. Even with a conservative 5x reduction, we’d be looking at $1,000–$3,000 in 10 years
even if our library triples in size.

Critical window
If these forecasts are accurate, we just need to wait a couple of years before our entire
collection will be widely mirrored. Thus, in the words of Thomas Jefferson, “placed beyond
the reach of accident.”

Unfortunately, the advent of LLMs, and their data-hungry training, has put a lot of copyright
holders on the defensive. Even more than they already were. Many websites are making it
harder to scrape and archive, lawsuits are flying around, and all the while physical libraries
and archives continue to be neglected.

We can only expect these trends to continue to worsen, and many works to be lost well
before they enter the public domain.

We are on the eve of a revolution in preservation, but “the lost cannot be recovered.”
We have a critical window of about 5-10 years during which it’s still fairly expensive to
operate a shadow library and create many mirrors around the world, and during which access
has not been completely shut down yet.

If we can bridge this window, then we’ll indeed have preserved humanity’s knowledge and
culture in perpetuity. We should not let this time go to waste. We should not let this critical
window close on us.

Let’s go.

- Anna and the team (Reddit, Telegram)

Megaplus Hospital Management System
No ratings yet
Megaplus Hospital Management System
17 pages
Keeping Our Biits About U NUNUNG
No ratings yet
Keeping Our Biits About U NUNUNG
5 pages
Escaping The Digital Dark Age
No ratings yet
Escaping The Digital Dark Age
5 pages
Storage Systems-UNIT-Ist
No ratings yet
Storage Systems-UNIT-Ist
20 pages
Computer Data Storage
No ratings yet
Computer Data Storage
4 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
5 pages
Instant Download Preservation and Archiving Challenges Solutions 1st Edition Joyce Mcintosh PDF All Chapters
100% (11)
Instant Download Preservation and Archiving Challenges Solutions 1st Edition Joyce Mcintosh PDF All Chapters
60 pages
Deep Store: An Archival Storage System Architecture
No ratings yet
Deep Store: An Archival Storage System Architecture
12 pages
[FREE PDF sample] Preservation and Archiving Challenges Solutions 1st Edition Joyce Mcintosh ebooks
100% (4)
[FREE PDF sample] Preservation and Archiving Challenges Solutions 1st Edition Joyce Mcintosh ebooks
51 pages
Final Paper ZB
No ratings yet
Final Paper ZB
10 pages
Energy Scenaro
No ratings yet
Energy Scenaro
23 pages
How Can You Store Data Forever?: Larye Parkins, Building, Troubleshooting, and Repairing Computers Since 65
No ratings yet
How Can You Store Data Forever?: Larye Parkins, Building, Troubleshooting, and Repairing Computers Since 65
17 pages
Storage Architecture: CE202 December 2, 2003 David Pease
No ratings yet
Storage Architecture: CE202 December 2, 2003 David Pease
31 pages
An Introduction To Hashing in The Era of MachinenbspLearning
No ratings yet
An Introduction To Hashing in The Era of MachinenbspLearning
10 pages
Radical Tactics of the Offline Library - Henry Warwick
No ratings yet
Radical Tactics of the Offline Library - Henry Warwick
55 pages
Digital Collections: Preservation and Problems: 3 4 0 International CALIBER-2008
No ratings yet
Digital Collections: Preservation and Problems: 3 4 0 International CALIBER-2008
9 pages
DNA Data Storage
No ratings yet
DNA Data Storage
9 pages
Notes Unit 5 OS
No ratings yet
Notes Unit 5 OS
16 pages
Towards Gigayear Storage Using A Silicon-Nitride/Tungsten Based Medium
No ratings yet
Towards Gigayear Storage Using A Silicon-Nitride/Tungsten Based Medium
19 pages
Storage, Indices, B Tree, Hashing in Dbms
100% (2)
Storage, Indices, B Tree, Hashing in Dbms
34 pages
1 CC By-Nc
No ratings yet
1 CC By-Nc
18 pages
Online Data
No ratings yet
Online Data
3 pages
Digital Dark Age - Wikipedia, The Free Encyclopedia
No ratings yet
Digital Dark Age - Wikipedia, The Free Encyclopedia
4 pages
Basics of New Media
No ratings yet
Basics of New Media
10 pages
Library Mirror
No ratings yet
Library Mirror
8 pages
3. RAID
No ratings yet
3. RAID
22 pages
Bytes Hard Drives
No ratings yet
Bytes Hard Drives
21 pages
Dbms Unit-V Notes
No ratings yet
Dbms Unit-V Notes
32 pages
Types of Storage
No ratings yet
Types of Storage
20 pages
Storage Dev 20jun
No ratings yet
Storage Dev 20jun
8 pages
Paper Title (Use Style: Paper Title) : Note: Sub-Titles Are Not Captured in Xplore and Should Not Be Used
No ratings yet
Paper Title (Use Style: Paper Title) : Note: Sub-Titles Are Not Captured in Xplore and Should Not Be Used
6 pages
Cambridge-Ict-Notes-Ch4-
No ratings yet
Cambridge-Ict-Notes-Ch4-
13 pages
Ch4-Data Storage and Indexing
No ratings yet
Ch4-Data Storage and Indexing
116 pages
2c97bbfe-56b7-4b82-b24e-efa1cb4c241b_4. Secondary Storage- Rev 2
No ratings yet
2c97bbfe-56b7-4b82-b24e-efa1cb4c241b_4. Secondary Storage- Rev 2
15 pages
Storage Strategy: Prepared by - Ayush Shukla 0819213015
No ratings yet
Storage Strategy: Prepared by - Ayush Shukla 0819213015
15 pages
Final Task 1
No ratings yet
Final Task 1
3 pages
CBCT2203 - Topic 4
No ratings yet
CBCT2203 - Topic 4
24 pages
Pages From TJDBD
No ratings yet
Pages From TJDBD
6 pages
Preservation And Archiving Challenges Solutions 1st Edition Joyce Mcintosh pdf download
No ratings yet
Preservation And Archiving Challenges Solutions 1st Edition Joyce Mcintosh pdf download
84 pages
ISC 404 - LECTURE 10(1)
No ratings yet
ISC 404 - LECTURE 10(1)
12 pages
Unit 4-Memory
No ratings yet
Unit 4-Memory
21 pages
CHAPTER 3 - Data Storage
No ratings yet
CHAPTER 3 - Data Storage
13 pages
Storage
No ratings yet
Storage
18 pages
Chapter 6 - External Memory
No ratings yet
Chapter 6 - External Memory
50 pages
Isu - 2020 - 40 3 - Isu 40 3 Isu200090 - Isu 40 Isu200090
No ratings yet
Isu - 2020 - 40 3 - Isu 40 3 Isu200090 - Isu 40 Isu200090
8 pages
Title: Storage Devices
No ratings yet
Title: Storage Devices
3 pages
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
No ratings yet
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
4 pages
Trev 307-Inphase PDF
No ratings yet
Trev 307-Inphase PDF
8 pages
Data Storage and AI
No ratings yet
Data Storage and AI
29 pages
Yr 8 - Storage Devices & Media
No ratings yet
Yr 8 - Storage Devices & Media
29 pages
Chapter 6 Report
No ratings yet
Chapter 6 Report
48 pages
dbms 5th Unit (2)
No ratings yet
dbms 5th Unit (2)
22 pages
Storage System in DBMS
No ratings yet
Storage System in DBMS
6 pages
D. Disk System Architecture
No ratings yet
D. Disk System Architecture
13 pages
STORAGE DEVICES IN COMPUTING
No ratings yet
STORAGE DEVICES IN COMPUTING
1 page
The Importance of Backups
No ratings yet
The Importance of Backups
2 pages
Types of External Memory: - Magnetic Disk
No ratings yet
Types of External Memory: - Magnetic Disk
50 pages
Optical Disk Storage
No ratings yet
Optical Disk Storage
24 pages
file-organization-and-management-theory-pages-4
No ratings yet
file-organization-and-management-theory-pages-4
30 pages
Rowdy Entrepreneurs and Insecure Dinosaurs: Popular Strategies for Innovation After the End of Endings
From Everand
Rowdy Entrepreneurs and Insecure Dinosaurs: Popular Strategies for Innovation After the End of Endings
Murat Karamuftuoglu
No ratings yet
Surfing the Internet
From Everand
Surfing the Internet
Jean Armour Polly
No ratings yet
Products Guide Book: - Crawler Cranes
No ratings yet
Products Guide Book: - Crawler Cranes
16 pages
4 Configuring The Wireless Network
No ratings yet
4 Configuring The Wireless Network
5 pages
ĐỀ 61 - ĐỀ CHUẨN CẤU TRÚC ĐỀ MINH HOẠ (SHARE)
No ratings yet
ĐỀ 61 - ĐỀ CHUẨN CẤU TRÚC ĐỀ MINH HOẠ (SHARE)
6 pages
NTPC Erp
No ratings yet
NTPC Erp
13 pages
Inspection-Certificate 2
No ratings yet
Inspection-Certificate 2
1 page
When Technology and Humanity Cross
No ratings yet
When Technology and Humanity Cross
3 pages
SH250-6 Hydraulic Excavator
No ratings yet
SH250-6 Hydraulic Excavator
14 pages
Contiguous Memory Allocation
No ratings yet
Contiguous Memory Allocation
6 pages
Progress Billing No.1
No ratings yet
Progress Billing No.1
1 page
Snake Game in Python - Using Pygame Module - Final
No ratings yet
Snake Game in Python - Using Pygame Module - Final
17 pages
Blue Screen of Death
No ratings yet
Blue Screen of Death
4 pages
An Introduction: Gajanan Maharaj America Devotees Parivar
No ratings yet
An Introduction: Gajanan Maharaj America Devotees Parivar
31 pages
RT Management Tailings Water Storage Standard
No ratings yet
RT Management Tailings Water Storage Standard
6 pages
DIGITAL AWARENESS PDF
No ratings yet
DIGITAL AWARENESS PDF
14 pages
Digital To Digital and Digital To Analog Conversion
No ratings yet
Digital To Digital and Digital To Analog Conversion
58 pages
Electronics Sample Problems 06
No ratings yet
Electronics Sample Problems 06
5 pages
Department of Computer Engineering: S.S.V.P.S's B. S. Deore Polytechnic, Dhule
No ratings yet
Department of Computer Engineering: S.S.V.P.S's B. S. Deore Polytechnic, Dhule
13 pages
SM-A530F SVC Guide - F PDF
No ratings yet
SM-A530F SVC Guide - F PDF
30 pages
BTU Solar Hybrid Air Conditioner
100% (1)
BTU Solar Hybrid Air Conditioner
2 pages
Modbus RTU: User's Manual
No ratings yet
Modbus RTU: User's Manual
33 pages
Bio Project
No ratings yet
Bio Project
11 pages
Library Management System
No ratings yet
Library Management System
12 pages
Spam 150 C
No ratings yet
Spam 150 C
3 pages
QuickSpecs HP Z4 G4 Workstation Technical Specifications - c05527757
No ratings yet
QuickSpecs HP Z4 G4 Workstation Technical Specifications - c05527757
99 pages
Offered by RH Factor Are:: HRMS Solutions
No ratings yet
Offered by RH Factor Are:: HRMS Solutions
3 pages
How Video Recording Is Done
No ratings yet
How Video Recording Is Done
8 pages
Catálogo Benevision N15
No ratings yet
Catálogo Benevision N15
7 pages
Windows 10 ADMX Spreadsheet
No ratings yet
Windows 10 ADMX Spreadsheet
433 pages
Prop Tech
No ratings yet
Prop Tech
5 pages

The Critical Window of Shadow Librarie

Uploaded by

The Critical Window of Shadow Librarie

Uploaded by

Anna’s Blog

The critical window of shadow libraries

Academic papers, journals, reports

1. We’ve plucked the low-hanging fruit

2. Storage costs continue to drop exponentially

That is today. But progress marches forwards:

100 million $/TB

100 $/TB Flash

HDD price trends from different sources (click to view study).

3. Improvements in information density

- Anna and the team (Reddit, Telegram)

You might also like