Core

The core contains the bootstrap code for summarization needs. The core provides:

  • A common standard structure for documents and summaries to ensure interoperability between different components.
  • Utilities for loading document sets into the common structure.
  • Common utilities on document sets, documents and sentences, for example sentence splitting, tokenization, etc.

Sentence class

class clstk.sentence.Sentence(sentenceText)

Bases: object

Class to represent a single sentence

__init__(sentenceText)

Set sentence text and translated text

Parameters:sentenceText – sentence text
setText(sentenceText)

Set text for the sentence

Parameters:sentenceText – sentence text
getText()

Get sentence text

Returns:sentence text
setTranslation(translation)

Set translated text

Parameters:translation – translated text
getTranslation()

Get translated text

The translated text defaults to sentence text

Returns:translated text
setVector(vector)

Set sentence vector

Parameters:vector – sentence vector
getVector()

Get sentence vector

Returns:sentence vector
setTranslationVector(vector)

Set sentence vector for translated text

Parameters:vector – sentence vector
getTranslationVector()

Get sentence vector for translated text

Returns:sentence vector
setExtra(key, value)

Set extra key-value pair

Parameters:
  • key – key for the stored value
  • value – value to store
getExtra(key, default=None)

Get extra value from key

Parameters:
  • key – key for the stored value
  • default – default value if key not found
charCount()

Get character count for translated text

Returns:Number of character in translated text
tokenCount()

Get token count for translated text

Returns:Number of tokens in translated text
__weakref__

list of weak references to the object (if defined)

SentenceCollection class

class clstk.sentenceCollection.SentenceCollection

Bases: object

Class to store a colelction of sentences.

Also proivdes several common operations on the collection.

__init__()

Initialize the collection

setSourceLang(lang)

Set source language for the colelction

Parameters:lang – two-letter code for source language
setTargetLang(lang)

Set target language for the colelction

Parameters:lang – two-letter code for target language
addSentence(sentence)

Add a sentence to the colelction

Parameters:sentence – sentence to be added
addSentences(sentences)

Add sentences to the colelction

Parameters:sentences – list of sentence to be added
getSentences()

Get list of sentences in the collection

Returns:list of sentences
getSentenceVectors()

Get list of sentence vectors for sentences in the collection

Returns:np.array containing sentence vectors
getTranslationSentenceVectors()

Get list of sentence vectors for translations of sentences in the collection

Returns:np.array containing sentence vectors
generateSentenceVectors()

Generate sentence vectors

generateTranslationSentenceVectors()

Generate sentence vectors for translations

translate(sourceLang, targetLang, replaceOriginal=False)

Translate sentences

Parameters:
  • sourceLang – two-letter code for source language
  • targetLang – two-letter code for target language
  • replaceOriginal – Replace source text with translation if True. Used for early-translation
simplify(sourceLang, replaceOriginal=False)

Simplify sentences

Parameters:
  • sourceLang – two-letter code for language
  • replaceOriginal – Replace source sentences with simplified sentences. Used for early-simplify.
__weakref__

list of weak references to the object (if defined)

Corpus class

class clstk.corpus.Corpus(dirname)

Bases: clstk.sentenceCollection.SentenceCollection

Class for source documents. Contains utilities for loading document set.

__init__(dirname)

Initialize the class

Parameters:dirname – Directory from where source documents are to be loaded
load(params, translate=False, replaceWithTranslation=False, simplify=False, replaceWithSimplified=False)

Load source docuement set

Parameters:
  • paramsdict containing different params including sourceLang and targetLang.
  • translate – Whether to translate sentences to target language
  • replaceWithTranslation – Whether to replace source sentences with translation
  • simplify – Whether to simplify sentences
  • replaceWithSimplified – Whether to replace source sentences with simplified sentences

Summary class

class clstk.summary.Summary

Bases: clstk.sentenceCollection.SentenceCollection

charCount()

Get total number of character in all the sentences

tokenCount()

Get total number of tokens in all the sentences

getSummary()

Get printable summary generated from source text

getTargetSummary()

Get printable summary generated from translated text