Distributionell karaktär hos vissa kategorier av ord

The amount of information stored on the internet grows daily and naturally the requirements on the systems used to search for and analyse information increases. As a part in meeting the raised requirements this study investigates if it is possible for a automatised text analysis system to distinguish certain groups and categories of words in a text, and more specifically investigate if it is possible to distinguish words with a high information value from words with a low information value. This is important to enable optimizations of systems for global surveillance and information retrieval. The study is carried out using word spaces, which are often used in text analysis to model language. The distributional character of certain categories of words is examined by studying the intrinsic dimensionality of the space, locally around different words. Based on the result from the study of the intrinsic dimensionality, where there seems to be differences in the distributional character between categories of words, an algorithm is implemented for classifying words based on the dimensionality data. The classification algorithm is tested for different categories. The result strengthens the thesis that there could exist useful differences between the distributional character of different categories of words.

Författare

Martin Bohman Emelie Kullmann

Lärosäte och institution

KTH/Skolan för teknikvetenskap (SCI)

Nivå:

"Kandidatuppsats". Självständigt arbete (examensarbete ) om minst 15 högskolepoäng utfört för att erhålla kandidatexamen.

Läs mer..

Senaste sökningar:

Energi

OAIS