lda

class topicpy.lda.lda(learning_method='online', max_doc_update_iter=5, max_iter=5, topic_word_prior=1, doc_topic_prior=1, random_state=42, **params)[source]
_approx_bound(X, doc_topic_distr, sub_sampling)

Estimate the variational bound.

Estimate the variational bound over “all documents” using only the documents passed in as X. Since log-likelihood of each word cannot be computed directly, we use this bound to estimate it.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • doc_topic_distr (ndarray of shape (n_samples, n_components)) – Document topic distribution. In the literature, this is called gamma.
  • sub_sampling (bool, default=False) – Compensate for subsampling of documents. It is used in calculate bound in online learning.
Returns:

score

Return type:

float

_check_feature_names(X, *, reset)

Set or check the feature_names_in_ attribute.

New in version 1.0.

Parameters:
  • X ({ndarray, dataframe} of shape (n_samples, n_features)) – The input samples.
  • reset (bool) –

    Whether to reset the feature_names_in_ attribute. If False, the input will be checked for consistency with feature names of data provided when reset was last True. .. note:

    It is recommended to call `reset=True` in `fit` and in the first
    call to `partial_fit`. All other methods that validate `X`
    should set `reset=False`.
    
_check_n_features(X, reset)

Set the n_features_in_ attribute, or check against it.

Parameters:
  • X ({ndarray, sparse matrix} of shape (n_samples, n_features)) – The input samples.
  • reset (bool) –

    If True, the n_features_in_ attribute is set to X.shape[1]. If False and the attribute exists, then check that it is equal to X.shape[1]. If False and the attribute does not exist, then the check is skipped. .. note:

    It is recommended to call reset=True in `fit` and in the first
    call to `partial_fit`. All other methods that validate `X`
    should set `reset=False`.
    
_check_non_neg_array(X, reset_n_features, whom)

check X format

check X format and make sure no negative value in X.

Parameters:X (array-like or sparse matrix) –
_check_params()

Check model parameters.

_e_step(X, cal_sstats, random_init, parallel=None)

E-step in EM update.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • cal_sstats (bool) – Parameter that indicate whether to calculate sufficient statistics or not. Set cal_sstats to True when we need to run M-step.
  • random_init (bool) – Parameter that indicate whether to initialize document topic distribution randomly in the E-step. Set it to True in training steps.
  • parallel (joblib.Parallel, default=None) – Pre-initialized instance of joblib.Parallel.
Returns:

doc_topic_distr is unnormalized topic distribution for each document. In the literature, this is called gamma. suff_stats is expected sufficient statistics for the M-step. When cal_sstats == False, it will be None.

Return type:

(doc_topic_distr, suff_stats)

_em_step(X, total_samples, batch_update, parallel=None)

EM update for 1 iteration.

update _component by batch VB or online VB.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • total_samples (int) – Total number of documents. It is only used when batch_update is False.
  • batch_update (bool) – Parameter that controls updating method. True for batch learning, False for online learning.
  • parallel (joblib.Parallel, default=None) – Pre-initialized instance of joblib.Parallel
Returns:

doc_topic_distr – Unnormalized document topic distribution.

Return type:

ndarray of shape (n_samples, n_components)

classmethod _get_param_names()

Get parameter names for the estimator

_init_latent_vars(n_features)

Initialize latent variables.

_perplexity_precomp_distr(X, doc_topic_distr=None, sub_sampling=False)

Calculate approximate perplexity for data X with ability to accept precomputed doc_topic_distr

Perplexity is defined as exp(-1. * log-likelihood per word)

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • doc_topic_distr (ndarray of shape (n_samples, n_components), default=None) – Document topic distribution. If it is None, it will be generated by applying transform on X.
Returns:

score – Perplexity score.

Return type:

float

_repr_html_

HTML representation of estimator.

This is redundant with the logic of _repr_mimebundle_. The latter should be favorted in the long term, _repr_html_ is only implemented for consumers who do not interpret _repr_mimbundle_.

_repr_html_inner()

This function is returned by the @property _repr_html_ to make hasattr(estimator, “_repr_html_”) return `True or False depending on get_config()[“display”].

_repr_mimebundle_(**kwargs)

Mime bundle used by jupyter kernels to display estimator

_unnormalized_transform(X)

Transform data X according to fitted model.

Parameters:X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
Returns:doc_topic_distr – Document topic distribution for X.
Return type:ndarray of shape (n_samples, n_components)
_validate_data(X='no_validation', y='no_validation', reset=True, validate_separately=False, **check_params)

Validate input data and set or check the n_features_in_ attribute.

Parameters:
  • X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features), default='no validation') – The input samples. If ‘no_validation’, no validation is performed on X. This is useful for meta-estimator which can delegate input validation to their underlying estimator(s). In that case y must be passed and the only accepted check_params are multi_output and y_numeric.
  • y (array-like of shape (n_samples,), default='no_validation') –

    The targets.

    • If None, check_array is called on X. If the estimator’s requires_y tag is True, then an error will be raised.
    • If ‘no_validation’, check_array is called on X and the estimator’s requires_y tag is ignored. This is a default placeholder and is never meant to be explicitly set. In that case X must be passed.
    • Otherwise, only y with _check_y or both X and y are checked with either check_array or check_X_y depending on validate_separately.
  • reset (bool, default=True) –

    Whether to reset the n_features_in_ attribute. If False, the input will be checked for consistency with data provided when reset was last True. .. note:

    It is recommended to call reset=True in `fit` and in the first
    call to `partial_fit`. All other methods that validate `X`
    should set `reset=False`.
    
  • validate_separately (False or tuple of dicts, default=False) – Only used if y is not None. If False, call validate_X_y(). Else, it must be a tuple of kwargs to be used for calling check_array() on X and y respectively.
  • **check_params (kwargs) – Parameters passed to sklearn.utils.check_array() or sklearn.utils.check_X_y(). Ignored if validate_separately is not False.
Returns:

out – The validated input. A tuple is returned if both X and y are validated.

Return type:

{ndarray, sparse matrix} or tuple of these

fit(X, y=None)

Learn model for the data X with variational Bayes method.

When learning_method is ‘online’, use mini-batch update. Otherwise, use batch update.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • y (Ignored) – Not used, present here for API consistency by convention.
Returns:

Fitted estimator.

Return type:

self

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.
  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
  • **fit_params (dict) – Additional fit parameters.
Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

full_analysis(directory, xl, tl=None, label='primary_site', logarithmise=False, round_data=False, *args, **kwargs) → None[source]
Parameters:
  • df
  • directory
  • xl
  • tl
  • kwargs – argouments to LatentDirichletAllocation().fit_transform
get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:dict
partial_fit(X, y=None)

Online VB with Mini-Batch update.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • y (Ignored) – Not used, present here for API consistency by convention.
Returns:

Partially fitted estimator.

Return type:

self

perplexity(X, sub_sampling=False)

Calculate approximate perplexity for data X.

Perplexity is defined as exp(-1. * log-likelihood per word)

Changed in version 0.19: doc_topic_distr argument has been deprecated and is ignored because user no longer has access to unnormalized distribution

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • sub_sampling (bool) – Do sub-sampling or not.
Returns:

score – Perplexity score.

Return type:

float

score(X, y=None)

Calculate approximate log-likelihood as score.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
  • y (Ignored) – Not used, present here for API consistency by convention.
Returns:

score – Use approximate bound as score.

Return type:

float

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:**params (dict) – Estimator parameters.
Returns:self – Estimator instance.
Return type:estimator instance
transform(X)

Transform data X according to the fitted model.

Changed in version 0.18: doc_topic_distr is now normalized

Parameters:X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
Returns:doc_topic_distr – Document topic distribution for X.
Return type:ndarray of shape (n_samples, n_components)