lda¶
-
class
topicpy.lda.
lda
(learning_method='online', max_doc_update_iter=5, max_iter=5, topic_word_prior=1, doc_topic_prior=1, random_state=42, **params)[source]¶ -
_approx_bound
(X, doc_topic_distr, sub_sampling)¶ Estimate the variational bound.
Estimate the variational bound over “all documents” using only the documents passed in as X. Since log-likelihood of each word cannot be computed directly, we use this bound to estimate it.
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- doc_topic_distr (ndarray of shape (n_samples, n_components)) – Document topic distribution. In the literature, this is called gamma.
- sub_sampling (bool, default=False) – Compensate for subsampling of documents. It is used in calculate bound in online learning.
Returns: score
Return type: float
-
_check_n_features
(X, reset)¶ Set the n_features_in_ attribute, or check against it.
Parameters: - X ({ndarray, sparse matrix} of shape (n_samples, n_features)) – The input samples.
- reset (bool) –
If True, the n_features_in_ attribute is set to X.shape[1]. If False and the attribute exists, then check that it is equal to X.shape[1]. If False and the attribute does not exist, then the check is skipped. .. note:
It is recommended to call reset=True in `fit` and in the first call to `partial_fit`. All other methods that validate `X` should set `reset=False`.
-
_check_non_neg_array
(X, reset_n_features, whom)¶ check X format
check X format and make sure no negative value in X.
Parameters: X (array-like or sparse matrix) –
-
_check_params
()¶ Check model parameters.
-
_e_step
(X, cal_sstats, random_init, parallel=None)¶ E-step in EM update.
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- cal_sstats (bool) – Parameter that indicate whether to calculate sufficient statistics
or not. Set
cal_sstats
to True when we need to run M-step. - random_init (bool) – Parameter that indicate whether to initialize document topic distribution randomly in the E-step. Set it to True in training steps.
- parallel (joblib.Parallel, default=None) – Pre-initialized instance of joblib.Parallel.
Returns: doc_topic_distr is unnormalized topic distribution for each document. In the literature, this is called gamma. suff_stats is expected sufficient statistics for the M-step. When cal_sstats == False, it will be None.
Return type: (doc_topic_distr, suff_stats)
-
_em_step
(X, total_samples, batch_update, parallel=None)¶ EM update for 1 iteration.
update _component by batch VB or online VB.
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- total_samples (int) – Total number of documents. It is only used when batch_update is False.
- batch_update (bool) – Parameter that controls updating method. True for batch learning, False for online learning.
- parallel (joblib.Parallel, default=None) – Pre-initialized instance of joblib.Parallel
Returns: doc_topic_distr – Unnormalized document topic distribution.
Return type: ndarray of shape (n_samples, n_components)
-
classmethod
_get_param_names
()¶ Get parameter names for the estimator
-
_init_latent_vars
(n_features)¶ Initialize latent variables.
-
_perplexity_precomp_distr
(X, doc_topic_distr=None, sub_sampling=False)¶ Calculate approximate perplexity for data X with ability to accept precomputed doc_topic_distr
Perplexity is defined as exp(-1. * log-likelihood per word)
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- doc_topic_distr (ndarray of shape (n_samples, n_components), default=None) – Document topic distribution. If it is None, it will be generated by applying transform on X.
Returns: score – Perplexity score.
Return type: float
-
_repr_html_
¶ HTML representation of estimator.
This is redundant with the logic of _repr_mimebundle_. The latter should be favorted in the long term, _repr_html_ is only implemented for consumers who do not interpret _repr_mimbundle_.
-
_repr_html_inner
()¶ This function is returned by the @property _repr_html_ to make hasattr(estimator, “_repr_html_”) return `True or False depending on get_config()[“display”].
-
_repr_mimebundle_
(**kwargs)¶ Mime bundle used by jupyter kernels to display estimator
-
_unnormalized_transform
(X)¶ Transform data X according to fitted model.
Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix. Returns: doc_topic_distr – Document topic distribution for X. Return type: ndarray of shape (n_samples, n_components)
-
_validate_data
(X, y='no_validation', reset=True, validate_separately=False, **check_params)¶ Validate input data and set or check the n_features_in_ attribute.
Parameters: - X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) – The input samples.
- y (array-like of shape (n_samples,), default='no_validation') –
The targets.
- If None, check_array is called on X. If the estimator’s requires_y tag is True, then an error will be raised.
- If ‘no_validation’, check_array is called on X and the estimator’s requires_y tag is ignored. This is a default placeholder and is never meant to be explicitly set.
- Otherwise, both X and y are checked with either check_array or check_X_y depending on validate_separately.
- reset (bool, default=True) –
Whether to reset the n_features_in_ attribute. If False, the input will be checked for consistency with data provided when reset was last True. .. note:
It is recommended to call reset=True in `fit` and in the first call to `partial_fit`. All other methods that validate `X` should set `reset=False`.
- validate_separately (False or tuple of dicts, default=False) – Only used if y is not None. If False, call validate_X_y(). Else, it must be a tuple of kwargs to be used for calling check_array() on X and y respectively.
- **check_params (kwargs) – Parameters passed to
sklearn.utils.check_array()
orsklearn.utils.check_X_y()
. Ignored if validate_separately is not False.
Returns: out – The validated input. A tuple is returned if y is not None.
Return type: {ndarray, sparse matrix} or tuple of these
-
fit
(X, y=None)¶ Learn model for the data X with variational Bayes method.
When learning_method is ‘online’, use mini-batch update. Otherwise, use batch update.
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- y (Ignored) –
Returns: Return type: self
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X (array-like of shape (n_samples, n_features)) – Input samples.
- y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
- **fit_params (dict) – Additional fit parameters.
Returns: X_new – Transformed array.
Return type: ndarray array of shape (n_samples, n_features_new)
-
full_analysis
(directory, xl, tl=None, label='primary_site', logarithmise=False, round_data=False, *args, **kwargs) → None[source]¶ Parameters: - df –
- directory –
- xl –
- tl –
- kwargs – argouments to LatentDirichletAllocation().fit_transform
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators. Returns: params – Parameter names mapped to their values. Return type: dict
-
partial_fit
(X, y=None)¶ Online VB with Mini-Batch update.
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- y (Ignored) –
Returns: Return type: self
-
perplexity
(X, sub_sampling=False)¶ Calculate approximate perplexity for data X.
Perplexity is defined as exp(-1. * log-likelihood per word)
Changed in version 0.19: doc_topic_distr argument has been deprecated and is ignored because user no longer has access to unnormalized distribution
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- sub_sampling (bool) – Do sub-sampling or not.
Returns: score – Perplexity score.
Return type: float
-
score
(X, y=None)¶ Calculate approximate log-likelihood as score.
Parameters: - X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix.
- y (Ignored) –
Returns: score – Use approximate bound as score.
Return type: float
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.Parameters: **params (dict) – Estimator parameters. Returns: self – Estimator instance. Return type: estimator instance
-
transform
(X)¶ Transform data X according to the fitted model.
Changed in version 0.18: doc_topic_distr is now normalized
Parameters: X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Document word matrix. Returns: doc_topic_distr – Document topic distribution for X. Return type: ndarray of shape (n_samples, n_components)
-