MatchConfig
This submodule contains the class definition for MatchConfig. Objects of type MatchConfig are intended to contain user-specified parameters for DataFrame matching (see match.py). Users can specify parameters such as columns to include in a primary DataFrame from a second (import_include_col) or where to output data (output_path). These objects can then be passed to the match function from match.py or Table.match from a Table object.
- class chromaquant.match.match_config.MatchConfig(do_export: bool = False, import_include_col: list[str] | None = None, local_filter_row: dict[str, str | bool | float | int] | None = None, match_conditions: list[dict[str, Any]] | None = None, multiple_hits_rule: Callable[[Any, DataFrame, str, float | int, bool], Series] | None = None, multiple_hits_column: str = '', output_cols_dict: dict[str, str] | None = None, output_path: str = 'match_results.csv')
Class used to define how data from two Pandas DataFrames should be matched.
Parameters
- do_exportbool, optional
True if match results should be exported to .csv, by default False.
- import_include_collist[str] | None, optional
List of columns to include in second DataFrame in addition to columns from first DataFrame, by default None.
- local_filter_rowdict[str, str | bool | float | int] | None,
optional
Dictonary containing name of column used to filter first dataframe as key and row value to filter by as value, by default None
- match_conditionslist[dict[str, Any]] | None, optional
List of conditions by which to match the dataframes (See Notes), by default None
- multiple_hits_ruleCallable[[DataFrame, str], Series] | None,
optional
Function that selects one Series (hit) from a DataFrame (multiple hits) with some built-in options like “SELECT_FIRST_ROW”, by default None
- multiple_hits_columnstr, optional
Name of column by which to apply the multiple hits rule, by default ‘’
- output_cols_dictdict[str, str] | None, optional
Dictionary containing keys set to column names as written in matched datasets and values set to column names as desired in output DataFrame, by default None
- output_pathstr, optional
Path to output file including file name and extension, by default ‘match_results.csv’
Raises
- ValueError
If more than two strings are passed in a list for the comparison parameter when adding a match condition in add_match_condition.
Notes
The expected structure of match_conditions is as follows:
[{ 'condition': cq.MatchConfig.IS_EQUAL, 'first_DF_column': str, 'second_DF_column': str, 'kwargs': { 'error': float (optional), 'or_equal': bool (optional), 'value_function': Callable (optional) } }, ...]
The condition can be replaced with GREATER_THAN, LESS_THAN, or any user-defined function with the same arguments and return pattern.
- static FUNCTION_OF(value: Any, DF: DataFrame, DF_column_name: str, value_function: Callable[[Any], Any], error: float | int = 0) DataFrame
Returns slice of a DataFrame where a passed value is a function of one of its column’s values.
Parameters
- valueAny
A value of any type, checked if a function of a DataFrame’s values.
- DFPandas DataFrame
A Pandas DataFrame to compare against value.
- DF_column_namestr
The name of the column in the DataFrame whose values are compared against value.
- value_functionCallable[[Any], Any]
A function that accepts a DataFrame’s value and returns a value that should be equal to some passed value.
- errorfloat | int, optional
A float or integer defining acceptable error for float or integer value, by default 0.
Returns
- pd.DataFrame
Slice of DataFrame where some value in a given column passed through a function is equal to a passed value.
- static GREATER_THAN(value: Any, DF: DataFrame, DF_column_name: str, or_equal: bool = False) DataFrame
Returns slice of a Dataframe where a passed value is greater than one of its column’s values.
Parameters
- valueAny
A value of any type, checked whether greater than any rows in DF.
- DFPandas DataFrame
A Pandas DataFrame to compare against value.
- DF_column_namestr
The name of the column in the DataFrame whose values are compared against value.
- or_equalbool, optional
True if value can be equal to values in DataFrame column, by default False.
Returns
- pd.DataFrame
Slice of DataFrame where values in a given column are less than a given value.
- static IS_EQUAL(value: Any, DF: DataFrame, DF_column_name: str, error: float | int = 0) DataFrame
Returns slice of a Dataframe where one of its column’s values are equal to some value.
Parameters
- valueAny
A value of any type, checked whether equal to any rows in DF.
- DFPandas DataFrame
A Pandas DataFrame to compare against value.
- DF_column_namestr
The name of the column in the DataFrame whose values are compared against value.
- errorfloat | int, optional
A float or integer defining acceptable error for float or integer value, by default 0.
Returns
- pd.DataFrame
Slice of DataFrame where values in a given column are equal to a given value, optionally within a given error.
- static LESS_THAN(value: Any, DF: DataFrame, DF_column_name: str, or_equal: bool = False) DataFrame
Returns slice of a Dataframe where a passed value is less than one of its column’s values.
Parameters
- valueAny
A value of any type, checked whether less than any rows in DF.
- DFPandas DataFrame
A Pandas DataFrame to compare against value.
- DF_column_namestr
The name of the column in the DataFrame whose values are compared against value.
- or_equalbool, optional
True if value can be equal to values in DataFrame column, by default False.
Returns
- pd.DataFrame
Slice of DataFrame where values in a given column are greater than a given value.
- static SELECT_FIRST_ROW(DF: DataFrame, column_name: str) Series
Multiple hits rule to select first row of DataFrame.
Parameters
- DFpd.DataFrame
DataFrame to apply multiple hits rule to.
- column_namestr
Name of column to consider in rule.
Returns
- pd.Series
A row from the passed DF.
- static SELECT_HIGHEST_VALUE(DF: DataFrame, column_name: str) Series
Multiple hits rule to select row of DataFrame where column has highest value
Parameters
- DFpd.DataFrame
DataFrame to apply multiple hits rule to.
- column_namestr
Name of column to consider in rule.
Returns
- pd.Series
A row from the passed DF.
- static SELECT_LOWEST_VALUE(DF: DataFrame, column_name: str) Series
Multiple hits rule to select row of DataFrame where column has lowest value
Parameters
- DFpd.DataFrame
DataFrame to apply multiple hits rule to.
- column_namestr
Name of column to consider in rule.
Returns
- pd.Series
A row from the passed DF.
- add_match_condition(condition: Callable[[DataFrame, str], Series], comparison: str | list[str], kwargs: dict[str, Any] = {})
Adds a new match condition to the MatchConfig instance.
Parameters
- conditionCallable(Any, pd.DataFrame, str, float | int, bool) -> pd.Series
A condition that accepts a comparison value of any type, a DataFrame to compare the value against, the name of the column containing values to compare to the comparison value, and optional parameters for the error and whether to use inclusive inequalities (e.g., greater than or equal to), respectively.
- comparisonstr or list[str]
The name of the columns to compare across two DataFrames (if the name of the column is the same for both) or a list of two column names to compare (if the column names are different).
- kwargsdict[str, Any]
A dictionary of additional keyword arguments to pass to the match condition. See each match condition option for applicable keywords.
Returns
None
- do_export: bool
- import_include_col: list[str]
- local_filter_row: dict[str, str | bool | float | int]
- match_conditions: list[Any]
- multiple_hits_column: str
- multiple_hits_rule: Callable[[DataFrame, str], Series]
- output_cols_dict: dict
- output_path: str