spinfo / tec

Code for Text Engineering courses, University of Cologne

Home Page:http://www.spinfo.phil-fak.uni-koeln.de/spinfo-lehre.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code for Text Engineering courses, University of Cologne

Information Retrieval

Course plan and material (in German)

Text Mining

Course plan and material (in German)

Functional Technical Uses Literature
tm1 Corpus and data access OOD und TDD basics; object DB und native queries DB4O; Crawler (ir6) Gamma et al. (1994), Kap. 1; Bloch (2008), Item 16
tm2 Data enrichment with standoff annotation Generics; XML binding for export und import; Schema generation as a form of MDD (code-first) Index (ir2); TF-IDF (ir5); JAXB (or Java 6) Thompson & McKelvie (1997); Bloch (2008), Ch. 5; Naftalin & Wadler (2006) Part 1
tm3 Text classification with naive bayes Delegation and strategy for modular classification Crawler (ir6) Gamma et al. (1994), S. 315; Bloch (2008), Item 21
tm4 Comparative text classification and evaluation Using the Weka-API, adapter for integration Weka (developer version) Gamma et al. (1994), S. 139; Witten & Frank (2005)
tm5 Flat k-means clustering and purity evaluation Java Concurrency API (CopyOnWriteArrayList, ExecutorService), visualization with Graphviz DOT TF-IDF vectors and cosine similarity (ir5) Bloch (2008), Item 68
tm6 Release engineering CRISP builds with Ant All previous code Clark (2006), Kap. 2

Instructions

  • Files runnable as Java application and JUnit test for each session can be found in package de.uni_koeln.phil_fak.iv.tm.pX.PraxisX.java (X for the session number)
  • To run all tests: run All.java as JUnit test (needs corpora in data/, run All.java as Java application to generate)
  • The Ant script can compile and deploy the code as an executable Jar (ant deploy), generate Javadoc (ant doc) and run tests (ant test), which are summarized in an HTML report (ant report)

Literature

  • Bloch, Joshua (2008), Effective Java, Second Edition, Addison-Wesley.
  • Clark, Mike (2006), Projekt-Automatisierung, Hanser.
  • Gamma, Erich, Helm, Richard, Johnson, Ralph and John Vlissides (1995), Design Patterns. Elements of Reusable Object-Oriented Software, Addison-Wesley.
  • Naftalin, Maurice and Philip Wadler (2006), Java Generics and Collections, O’Reilly.
  • Thompson, H. S. and McKelvie, D. (1997), Hyperlink semantics for standoff markup of read-only documents. In Proceedings of SGML Europe ’97: The next decade – Pushing the Envelope, page 227–229.
  • Ian H. Witten & Eibe Frank (2005), Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann.
#tableborders td {border: 1px solid #ccc; padding: .1em .25em;}

About

Code for Text Engineering courses, University of Cologne

http://www.spinfo.phil-fak.uni-koeln.de/spinfo-lehre.html


Languages

Language:Java 98.6%Language:Graphviz (DOT) 1.4%