DEFINITION: archive
(Noun) A colleciton of historical documents or records providing information about a place, institution, or, group of people.
(Verb) Place or store (something) in an archive.
Recently, I have been preoccupied with the idea of digital archives. While there are many debates around digital archives — from their superiority or inferiority to hardcopy archives, to the reliability of the preservation methods — I have opted to think more about how the origin or representative members of an archiving organisation affect the shape of the archive built.
I wish I was at a point to do some kind of cross-sectional study, looking at various digital archives from Africa to Europe to the Americas to Asia. But, until then, I thought it was time to at least begin thinking a bit more deeply about archives.
In this blog post, I will talk about what moved me to consider the role of digital archives. Additionally, I will talk about my developing data visualisation project (a work-still-very-much-in-progress), followed with some mercurial thoughts on archiving.
On 7 December 2023, the Library of Great Omari Mosque in Gaza was bombed. As one of the oldest mosques in Gaza, it held rare manuscripts concerning Islam, sciences, and poetry. Centuries of sieges, wars, pillaging, and occupation has seen the collection greatly diminished, and now almost six months ago, nearly destroyed. The destruction of this institution in the centre of Gaza illustrates the generational-ranging cruelty being committed here: not only are millions of people witnessing their city become unlivable, generations of people’s work and care are being callously wiped out.
I don’t know how to comment from the comfort of my dorm room that Palestinians and Palestinian history will surely prevail: their despair cannot be described or translated, and their hope cannot be excised and re-experienced.
Here, I am thinking out loud about preservation and archiving. I will speak about a data visualisation project, Data and Literature, that I am (slowly) working on, using Project Gutenberg to give the discussion some focus.
Data and Literature, my data visualisation project, will visualise the nationalities of the authors in Project Gutenberg, a volunteer-run digital library that hosts public domain books and documents. I specifically chose Project Gutenberg because it states that it archives books that passed their copyright in the United States, which I thought would surely skew the focus of the literature in the archive.
I have broken this project script into four distinct parts: (i) data collection; (ii) data cleaning; (iii) data mapping; and (iv) data visualisation. These four steps will be done using Python and R, trying to use their strengths in parsing and data organisation. For example, where Python has superior libraries for web scraping, R has more specific tools for processing and organising raw data. It has been both exciting and challenging to find comprehensive solutions in these languages.
Fortunately, I stand on the shoulders of giants: there a Project Gutenberg R dataset, gutenbergr
, which simplified the author metadata collection. However, it does not collect author nationality, so I collected that on my own by scraping Wikipedia, using Python’s request
and BeautifulSoup
libraries. And then, once I am happy with the state of the data (Wikipedia articles aren’t as standardised as I assumed), I will use R to do some (hopefully) insightful visualisations. Thus, all the challenges in creating this project has led to me to finding solutions that can be used for much bigger projects. Discovering how many resources I have at my disposable made me more conscious about establishing a firm scope for my project.
Data and Literature is also an opportunity for me to think more independently about the research and data analysis process. In the job I worked before my studies, data analysis and organisation was a major part of my day-to-day tasks, and I especially enjoyed working with my supervisor and seeing his process, from collection to analysis. So, this project is me somewhat formally taking the skills I learnt there and turning them into something of my own. When I have finished the project, I’ll include a more detailed write-up here and publish the scripts on GitHub, for those curious.
In 2021, 211 manuscripts were digitally preserved from the Library of Great Omari Mosque, an archiving project organised by the Hill Museum & Manuscript Library. It is difficult to compare this archive with all that has been lost. However, I have come to believe that that is a non-starter; there is something more important to realise: there was and always will be people who care so much that they do what seems like an insubstantial action against evil, but they have left us with something that was intended to be lost.
I wrote “suppose we preserve for a future we will never see.” I hope, very much, that supposition is wrong. I hope more than anything that I live long enough to hear that some ambitious child comes out of the rubble that has been made of her city is working to restore her library. That she has gone over digital records and will leave handwritten records of her own. I hope, I hope, I hope she archives, again.
Recommendation: Cloud Atlas, both the novel by David Mitchell and the film directed by the Wachowski sisters and Tom Tykwer.