P Publications
Natural Language Processing for Historical Text
More and more historical texts are becoming available in digital form. Digitization of paper documents is motivated by the aim of preserving cultural heritage and making it more accessible, both to laypeople and scholars. As digital images cannot be searched for text, digitization projects increasingly strive to create digital text, which can be searched and otherwise automatically processed, in addition to facsimiles. Indeed, the emerging field of digital humanities heavily relies on the availability of digital text for its studies.
Together with the increasing availability of historical texts in digital form, there is a growing interest in applying natural language processing (NLP) methods and tools to historical texts. However, the specific linguistic properties of historical texts—the lack of standardized orthography, in particular—pose special challenges for NLP.
This book aims to give an introduction to NLP for historical texts and an overview of the state of the art in this field. The book starts with an overview of methods for the acquisition of historical texts (scanning and OCR), discusses text encoding and annotation schemes, and presents examples of corpora of historical texts in a variety of languages. The book then discusses specific methods, such as creating part-of-speech taggers for historical languages or handling spelling variation. A final chapter analyzes the relationship between NLP and the digital humanities.
Certain recently emerging textual genres, such as SMS, social media, and chat messages, or newsgroup and forum postings share a number of properties with historical texts, for example, nonstandard orthography and grammar, and profuse use of abbreviations. The methods and techniques required for the effective processing of historical texts are thus also of interest for research in other domains.
Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies #17. Morgan & Claypool, San Rafael, CA, USA, 2012. DOI: 10.2200/S00436ED1V01Y201207HLT017
Publications
Full list of publications on ORCID.
My profile in Google Scholar and Microsoft Academic Search.
Recent publications (since 2017)
Peer-reviewed articles in international scientific journals
- Michael Piotrowski and Max Kemman (2023). “Institutional Arrangements in the Absence of Disciplinary Definitions: Digital Humanities in Switzerland”. Swiss Journal of Sociology, 49.3, pp. 519–540. DOI: 10.2478/sjs-2023-0025
- Michael Piotrowski et Aris Xanthos (2020). « Décomposer les humanités numériques ». In : Humanités numériques 1. DOI: 10.4000/revuehn.381
- Michael Piotrowski (2019). “Historical models and serial sources”. In: Journal of European Periodical Studies 4.1, pp. 8–18. DOI: 10.21825/jeps.v4i1.10226
- Michael Piotrowski (2019). “Accepting and modeling uncertainty”. In: Zeitschrift für digitale Geisteswissenschaften (Sonderband 4): Die Modellierung des Zweifels. Schlüsselideen und -konzepte zur graphbasierten Modellierung von Unsicherheiten. Ed. by Andreas Kuczera, Thorsten Wübbena, and Thomas Kollatz. DOI: 10.17175/sb004_006a
- Michail Maiatsky, Alexey Boyarsky, Natalia Boyarskaya, Ekaterina Velmezova, and Michael Piotrowski (2018). “VicoGlossia: annotatable and commentable library as a bridge between reader and scholar (a proof of concept study: early Soviet philological culture)”. In: Umanistica Digitale 2.2. DOI: 10.6092/issn.2532-8816/7253
Peer-reviewed conference proceedings
- Emily Öhman, Michael Piotrowski, and Mika Hämäläinen (2023). “The Great Digital Humanities Disconnect: The Failure of DH Publishing.” In: Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pp. 132–137. Association for Computational Linguistics, Stroudsburg, PA. URL: https://aclanthology.org/2023.nlp4dh-1.16
- Michael Piotrowski (2023). “Incertitude et histoire numérique”. In: Actes du colloque Humanistica 2023. Colloque Humanistica 2023 (Genève, June 26–28, 2023). Association francophone des humanités numériques. URL: https://hal.science/hal-04108242
- Cerstin Mahlow and Michael Piotrowski (2022). “Academic writing and publishing beyond documents”. In: Proceedings of the 22nd ACM Symposium on Document Engineering. DOI: 10.1145/3558100.3563840
- Mateusz Fafinski and Michael Piotrowski (2020). “Modelling Medieval Vagueness. Towards a Methodology of Visualising Geographical Uncertainty in Historical Texts”. In: Ralf H. Reussner, Anne Koziolek, Robert Heinrich (eds.): INFORMATIK 2020: 50. Jahrestagung der Gesellschaft für Informatik; 3. Workshop InfDH 2020 “Methoden und Anwendungen der Computational Humanities”. Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn, pp. 1317–1326. DOI: 10.18420/inf2020_123
- Michael Piotrowski and Mateusz Fafinski (2020). “Nothing new under the sun? Computational humanities and the methodology of history”. In: CHR2020: Proceedings of the Workshop on Computational Humanities Research (Amsterdam, Nov. 18–20, 2020). CEUR Workshop Proceedings, pp. 171–181. URL: http://ceur-ws.org/Vol-2723/short16.pdf
- Michael Piotrowski and Markus Neuwirth (2020). “Prospects for computational hermeneutics”. In: Atti del IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l’informatica umanistica (Milan, Jan. 15–17, 2020). Ed. by Cristina Marras, Marco Passarotti, Greta Franzini, and Eleonora Litta. Associazione per l’Informatica Umanistica e la Cultura Digitale (AIUCD), pp. 204–209. DOI: 10.6092/UNIBO/AMSACTA/6316
- Michael Piotrowski (2019). “A Vision for User-Defined Semantic Markup”. In: Proceedings of the ACM Symposium on Document Engineering 2019 (DocEng '19). Association for Computing Machinery, New York, NY, USA, Article 28, 1–4. DOI: 10.1145/3342558.3345414 Available from the ACM Digital Library
- Michael Piotrowski (2019). “History and the future of markup”. In: Proceedings of XML Prague 2019 (Prague, Feb. 7–9, 2019). Ed. by Jiří Kosek, pp. 323–333. URL: http://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf#page=335
- Michael Piotrowski (2018). “Digital humanities: an explication”. In: Proceedings of InfDH 2018 (Sept. 25, 2018). Ed. by Manuel Burghardt and Claudia Müller-Birn. Gesellschaft für Informatik. Berlin. DOI: 10.18420/infdh2018-07
- Hatem Mousselly Sergieh, Michael Piotrowski, and Iryna Gurevych (Aug. 2017). “EGOlink: supporting editors of online historical sources through automatic link discovery”. In: Proceedings of Digital Humanities 2017 (DH 2017). ADHO. Montréal, Canada, pp. 758–761. URL: https://dh2017.adho.org/abstracts/163/163.pdf
Edited proceedings
- Michael Piotrowski, ed. (2018). Proceedings of the Workshop on Computational Methods in the Humanities (COMHUM 2018). Workshop on Computational Methods in the Humanities 2018 (Lausanne, June 4–5, 2018). CEUR Workshop Proceedings. URL: http://ceur-ws.org/Vol-2314/
- Michael Piotrowski, ed. (2018). COMHUM 2018: book of abstracts for the Workshop on Computational Methods in the Humanities 2018. Workshop on Computational Methods in the Humanities 2018 (Lausanne, June 4–5, 2018). DOI: 10.5281/zenodo.1312778
Outreach (not peer-reviewed)
- Michael Piotrowski (2021). « Qu’est-ce que le numérique a changé dans notre rapport à la recherche ? ». In : Journée de la recherche en Lettres 2021. La recherche et le numérique. Sous la dir. d’Ekaterina Velmezova. Université de Lausanne, Faculté des lettres. URL : https://www.unil.ch/files/live/sites/lettres/files/shared/Faculte/rdv-annuels/recherche/2021/Journee-Recherche-Lettres-2021-Michael-Piotrowski.pdf
- Michael Piotrowski (2020). « Modèles de mobilité des musiciens et migration des motifs musicaux ». In : Journée de la recherche en Lettres 2020. La recherche et la mobilité. Sous la dir. d’Ekaterina Velmezova. Université de Lausanne, Faculté des lettres. URL : https://www.unil.ch/lettres/files/live/sites/lettres/files/shared/Faculte/rdv-annuels/recherche/2020/08-Michael-Piotrowski.pdf
- Michael Piotrowski (2019). « La technologie et l’interdisciplinarité ». In : Journée de la recherche en Lettres 2019. Interdisciplinarité, pluridisciplinarité, multidisciplinarité, transdisciplinarité dans le monde académique d’aujourd’hui : avantage ou obstacle ? Journée de la recherche en Lettres 2019 (Lausanne, 15 mars 2019). Sous la dir. d’Ekaterina Velmezova . Université de Lausanne, Faculté des lettres. URL : https://www.unil.ch/lettres/files/live/sites/lettres/files/shared/Faculte/rdv-annuels/recherche/2019/06-Michael-Piotrowski.pdf
- Michael Piotrowski (2018). «Digital Humanities – zwischen Metawissenschaft und neuer Disziplin». In: SocietyByte. Wissenschaftsmagazin des BFH-Zentrums Digital Society Dezember 2018. URL: https://www.societybyte.swiss/2018/12/05/was-heisst-und-zu-welchem-ende-studiert-man-digital-humanities/
Preprints and Working Papers
- Michael Piotrowski (2023). Uncertainty as Unavoidable Good. Center for Uncertainty Studies Working Papers 5. Bielefeld: Center for Uncertainty Studies (CeUS). DOI: 10.4119/unibi/2983506
- Michael Piotrowski (2019). “Ain’t no way around it. Why we need to be clear about what we mean by ‘Digital Humanities’”. In: Wozu Digitale Geisteswissenschaften? Innovationen, Revisionen, Binnenkonflikte. Symposienreihe Digitalität in den Geisteswissenschaften: «5. Wozu Digitale Geisteswissenschaften? Innovationen, Revisionen, Binnenkonflikte» (Lüneburg, Nov. 20–22, 2019). Ed. by Martin Huber, Sybille Krämer, and Claus Pias. DOI: 10.31235/osf.io/d2kb6. Submitted