Dativ ist dem Genitiv sein Tod: Story of a Bloody Murder in SpaCy and Numpy

German genitive case is roughly the case that marks possession and is expressed by the apostrophe (‘s) and possessive “of” in English. See quick examples:

Die Nachrichten des Tages          news of the day
Der Beruf des Mannes ist Arzt.     Profession of the man is doctor./Man's profession is doctor.

The genitive of possession and relationships occurs frequently in formal standart German. For instance the following sentence about private customer loans, is from Deutsche Bank customer service website:

Sie ist befugt, den Darlehensnehmer bei der Beantragung des Darlehens zu 

In everyday spoken German, story is completely different. von + Dative often replaces Genitive. See the example:

Das ist der Hund des Mannes.  That's the man's dog.
Das ist def Hund vom Mann.    That's the dog from the man. -> That's the dog of the man.

In this post, we’ll carry an emprical examination if Dative really replaced Genitive. We’ll investigate cases of nominal genitiv and prepositinal genitiv i.e.

  • Is von + Dative replacing Genitive?
  • What’s up with prepositions that are supposed to be used with genitive? I often see them incorrectly used with Dative, as well in informal wriiting language.

Basically what we’ll do is:

  • Collect samples
  • Look for all genitive forms and count how many of them is real genitive, how many of them dative replacement
  • Look for genitive prepositions, count how many of them used correctly
  • Count occurences of Dessen, Deren and Wessen, see the explanations below

First, let’s see the genitive case in detail, then we can jump onto our dataset and do the hypothesis testing!

Genitive in NP

In grammar, genitive is the grammatical case that marks a noun as modifying another noun. Genitive also has prepositional and adverbial uses. In general, NP in German consists of the following word order:

Article, Number, Adjective(s), Noun, Genitive attribute, Position(s), Relative Cluase, Reflexive Pronoun

significantly longer than the English counterpart. However, usual noun phrases are not this long of course, usually of the form Article + Adjective? + Noun (so don’t be scared and run away :blush:). Reason that genitive attribute use is declining might be due to this reason as well, not to make long noun phrases; longer it gets more difficult to understand. Genitive in NP, i.e. nominal genitive marks possession:

Der Beruf des alten Mannes     the profession of the old man
Des alten Mannes Beruf         the old men's profession

and replaced by von + dative combo in everyday spoken language:

der Hund meines Bruders  ->  der Hund von meinem Bruder

#1 murderer is, this usage of the Dative. Though this construction is not incorrect, considered as “not very elegant”. However, educated or not many people use these sort of construction in spoken language and informal written communication such as texting, emailing and blogging. That’s how a language use one of its cases, in my opinon.

Prepositional Genitive

In formal standart German, the object of an accusative preposition takes the genitive case. Some of common genitive prepositions are

jenseits      on the other side of
anlässlich    on the occasion of
kraft         by virtue of
anstelle      in place of
laut          according to
aufgrund      on the basis of
seitens       on the part of
außerhalb     outside of
trotz         despite, in spite of
bezüglich     with regard to
während       during
innerhalb     within
wegen         because of

with some examples:

innerhalb eines Tages          within a day
statt des Hemdes               instead of the shirt
während unserer Abwesenheit    during our absence

In my opinoin, #2 murderer is incorrectly used datives after these prepositions. For instance, evein in written language I see a lot wegen dem instead of the correct use wegen des. Unlike replacement with Dative, this usage is wrong.

Personal, Relative, Interrogative and Demonstrative Pronouns

Wessen literally means whose. Here are some demonstrative pronouns:

desjenigen
desselben

relative pronouns:

derer
deren
dessen

and personal pronouns:

meines
deines
seines
ihres
unseres
eures
ihres

There is also adverbial genitive and some verbs that are used with genitive but quite limited use. Hence in our emprical study we’ll count easily distinguishable cases (we’ll be über-tricky :wink:).

The Dataset

We’ll dig the Yelp restaurant reviews dataset, reviews from Germany has reviews in German generally. Reviews are written in informal language, hence ideal for our emprical study :wink: Go ahead and download the famous Yelp dataset. I pulled a little stunt to find German cities, flushed the list to city_list.txt. Reviewed restaurants are in:


Denkendorf
Freyburg
Filderstadt
Ditzingen
Waiblingen
Neuhausen
Henderson
Fellbach
Ludwigsburg
Esslingen am Neckar
Esslingen
Stuttgart
Leonberg
Boblingen
Sindelfingen
Stuttgart-Vaihingen
Stuttgart - Bad Cannstatt
Ostfildern
Gerlingen
Ludwigsburg
Leinfelden-Echterdingen
Kornwestheim
Schwaikheim
Remseck am Neckar
Remseck

Then we’re ready to find business_ids of restaurants in German cities with find_id_from_city.shand write them into ids.txt. 7044 places joined Yelp dataset from Germany:

$ wc -l ids.txt
7044 ids.txt

In order to select German restaurant reviews from review.json, I play a bit with jq instead of benefitting from chunk reading talents of Pandas. Obviously such a huge json can’t be read into memory once, one has to iterate in chunks. However, as a text miner I play with jq a lot, here I decided to filter first German reviews then read them into Python. Surely Pandas provide nice methods for chunk iterating, but remember there’s always more than one way to swim a fish :wink: Following lines will select lines from review.json where business_id is in ids.txt:

$ jq -R . ids.txt > ids.json
$ jq --slurpfile ids ids.json 'map(select(.business_id as $id|any($ids[];$id==.)))' review.json > german_reviews.json

Note that there are also English reviews for German restaurants, mainly by expats. We’ll make a small trick and filter mixed reviews by existence of the words ich, Sie, und, aber, oder, bin, habe, kann, sind, hatte, gern, gerne, viele, nicht, kein, keine, mehr, vieles, ein, eine, sehr, muss, die, der, das, ja. Roughly, %99 of the German written text includes at least one of these words, frequent personal pronouns, modal and auxiliary verbs, adverbs and articles. :wink:

$ egrep -i "\b(ich|Sie|und|aber|oder|bin|habe|kann|sind|hatte|gern|gerne|viele|nicht|kein|keine|mehr|vieles|ein|eine|sehr|muss|die|das|ja)\b" german_reviews.json > german_reviews.json

There are total 32564 reviews about 7044 different business, it seems:

$ wc -l german_reviews.json
32564 german_reviews.json

After preparing the corpus, we’re ready to move onto the counting parts. We keep the text only, we don’t need the other fields such as stars or user id.

$ jq .text german_reviews.json > german_reviews.txt

#1 Murderer: Nominal Genitive Replacements

As I wrote previously, I suspect this is the most common type of avoiding the genitive. Basically we’ll

  • count number of all definite articles der, die, das, des, dem, den and see percentage of des
  • count all possessive noun phrases and see how many of them is with genitive.

For the first task, we iterate over all reviews and count tokens with ART tag. ART includes both definite and indefinite articles, so we need to filter the results a bit.

from __future__ import unicode_literals

import codecs
from collections import Counter
import spacy

nlp = spacy.load("de")

def_arts_list = ["die", "der", "das", "den", "dem", "des"]

counter = Counter()

with codecs.open("german_reviews.txt", "r", encoding="utf-8") as f:
    for line in f:
        review = nlp(line.strip())
        def_arts = [t.text.lower() for t in review if t.tag_=="ART" and t.text.lower() in def_arts_list]
        counter.update(def_arts)

print counter

Here is the result:

Counter({u'die': 69698, u'der': 55810, u'das': 38734, u'den': 22346, u'dem': 12347, u'des': 6435})

It doesn’t look that bad indeed. Des is half as Dem, which is not that bad. There are 205370 definite articles in 32564 reviews i.e. 6 article per review and %3 of all articles is Des.

import numpy as np
import matplotlib.pyplot as plt

N = len(counter)
x = np.arange(1,N+1)
y = [num for (s, num) in counter.items() ]
labels = [ s for (s, num) in counter.items() ]

width = 0.35 #Use 1 to make it as a histogram
bar1 = plt.bar( x, y, width, color="y")
plt.ylabel( 'Number of Ocuurences' )
plt.xticks(x + width/2.0, labels )
plt.show()
dem_barchart.png

Ok, it looks that bad now. %3 is higher than I expected, but still looks very close to graveyard as well. In Turkish one can describe the situation with the idioms “To have one foot in the grave” or “To have eyes looking down to the soil”.

In the second task, we’ll do a more detailed count. We’ll count all possessive noun phrases and see percentage of genitive. We count

ART NN ART ADJA? NN             der Beruf des (alten) Mannes
ART ADJA? NN NN                 des alten mannes Beruf
ART NN PPOSAT NN                der Hund meines Bruders
ART NN APPR (ART|PPOSAT) NN     der Hund von meinem Bruder, der Weg von der Haltestelle

sequences and filter by some small tricks to distinguish genitive ones :wink:. We begin by counting noun chunks of the form 4th. We’ll use Matcher class from Spacy to match POS tags. Then we’ll do a filtering by existing of vor because APPR (ART|PPOSAT) matches a bigger superclass of strings of interest, see some:

einer Woche mit einem Groupon
der Kellner mit einem Grinsen
die Schmiereien in den Toiletten
der Kellner mit der Speisekarte, dem Wunsch nach der Rechnung

Note that German POS tags of Spacy come from Stuttgart tagset. German tagset is richer than English counterpart, due to rich morphosyntactic features. Text miners who work with German are acquainted with Stuttgart tagset, which is the standart tagset for German language. I will do a very rough preprocessing, then do the counting:

from spacy.matcher import Matcher
from spacy.attrs import TAG

matcher = Matcher(nlp.vocab)

tags = [
    [{TAG:"ART"}, {TAG:"NN"}, {TAG:"APPR"}, {TAG:"PPOSAT"}, {TAG:"NN"}],
    [{TAG:"ART"}, {TAG:"NN"}, {TAG:"APPR"}, {TAG:"ART"}, {TAG:"NN"}],
    ]

[matcher.add_pattern("noun noun chunk", tag_pattern) for tag_pattern in tags]

little_preprocess = lambda rev: " ".join(rev.replace("\\n", " ").strip().split())

vors = ["vor ", "vom ", "von "]
count = 0 
with codecs.open("german_reviews.txt", "r", encoding="utf-8") as f:
    for line in f:
        review = nlp(little_preprocess(line))
        matches = matcher(review)
        match_strings = [review[m[-2]:m[-1]] for m in matches]
        match_strings = filter(lambda match: any(v in match.text for v in vors), match_strings)
        count += len(match_strings)
print count

Result is 451 and some example matches are:

einen Zeitungsberricht von der Eröffnung
Ein Freisitz vor dem Restaurant
Der Ausblick von der Terrasse
Die Sepia von der Tageskarte
ein Fleischgericht von der Tageskarte
der Nähe von der Bar
der Tisch von der Kellnerin
den Platz vor dem Salatbüffet
[Die Pizza von meinem Mann
dem Elefanten von unserem Nachbarn (Elefants of our neighbor)?!!?

Now we count the “real” genitives. We pull the same stunt with Matcher with the corresponding POS tags. Here are some examples from the matched strings:

eine Nachbau des antiken Kolosseums
des Spaßfaktors den Besuch
Die Kreationen des Küchenchefs
Die Besonderheit des Restaurants
den Sternen des Hotels
das Thema des Hotel
Der Hummus des Tages
das Geheimnis des Ladens
Das Highlight des Abends
die Qualität des Essens
des Geschmacks des Essens
Das Konzept des Ladens, das Konzept des Restaurants, einem Drittel des Gerichtes, den Eindruck des Ambientes
die Qualität des alten Mövenpicks
die Qualität des Fisches
die Qualität des Sushis

and the number is 2474?!!??. Good Lord, either I counted dative constructions wrong or … I should stop writing immediately :flushed: My theory is completely wrong?!(so is Bastian Sick, sorry Herr Sick!)

Wrong Usages with Prepositional Genitive

Ok, here also we use Matcher class as follows: First we’ll match phrases of the form one of Genitive Prepositions + Article. Then, we’ll count how many of the articles both definite and indefinite, in genitive or in dative and compare the numbers. Note that Dative usage is WRONG, unlike von + dative replacements. Let’s hit it:

preplist = [ 
u"jenseits",      
u"anlässlich",    
u"kraft",         
u"anstelle",      
u"laut",          
u"aufgrund",    
u"seitens",       
u"außerhalb", 
u"trotz",         
u"bezüglich",     
u"während",     
u"innerhalb",     
u"wegen"         
]

from spacy.matcher import Matcher
from spacy.attrs import TAG, ORTH
matcher = Matcher(nlp.vocab)
tags = [[{ORTH:w}, {TAG:"ART"}] for w in preplist]
    
[matcher.add_pattern("noun noun chunk", tag_pattern) for tag_pattern in tags]

Rest is almost same with previous iterating over the corpus and counting code. Here are some wrong usages with dative:

während dem
wegen dem
wegen einem
trotz dem

Nahaa!!! I caught Dative with his two hands covered in blood!! :grin: Out of 1845 prepositional genitive constructions, 241 usages were wrongly with dative…looks like the real murderer is here :scream: :scream: :scream:.

Genitive Pronouns

Earlier we saw personal, relative, interrogative and demonstrative pronouns in genitive forms. Now we make a rough count. We’ll find ratio of our genitive guys wessen, desselben, derer, deren, dessen to all words with pronoun tags PDAT, PDS, PWS, PWAT, PWAV. Here are the results:

counter.most_common()[:30]

[(u'das', 12147),
 (u'was', 6632),
 (u'wie', 2995),
 (u'diese', 2764),
 (u'wer', 2628),
 (u'dieses', 2175),
 (u'dieser', 2138),
 (u'diesem', 2041),
 (u'wo', 1568),
 (u'dies', 1346),
 (u'diesen', 1053),
 (u'die', 891),
 (u'warum', 675),
 (u'der', 584),
 (u'wobei', 539),
 (u'dem', 371),
 (u'welche', 314),
 (u'denen', 221),
 (u'welches', 191),
 (u'draussen', 178),
 (u'deren', 157),
 (u'drinnen', 154),
 (u'welcher', 144),
 (u'den', 140),
 (u'weshalb', 135),
 (u'wann', 126),
 (u'dass', 125),
 (u'dessen', 122),
 (u'wem', 95),
 (u'welchen', 89)]

Our friends deren and dessen made it to top 30, which is sort of a success. Distribution of our genitive friends is as follows:

wessen  1
desselben   1
derjenigen  1
derer   3
dessen  122
deren   157

Genitive friends looks a bit miserable comparing to popular guys:

wessen.png

Good news is they do exist, bad news is they barely exist. However, there’s no solid murderer party here; there’s no conclusive sign that Dative committed his sins here as well. Then who is to blame?

Conclusion

After all the investigation, this murder trial didn’t quite come to a full conclusion. The Dative has blood in his hands definitely; however the Genitive doesn’t look completely dead, but with serious wounds. Best we can do is to ive our best wishes to the victim and hope he continues his existance further. Neverthless, don’t bother for him too much. Life is too short to learn Germn anyway :relaxed: