Skip to content

Commit c2a5916

Browse files
committed
Implemented pca_by_eigen()
1 parent e574261 commit c2a5916

File tree

11 files changed

+407
-63
lines changed

11 files changed

+407
-63
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
3737
## <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html" target="_blank"><B>DataFrame documentation with code samples</B></a>
3838
This is a C++ analytical library designed for data analysis similar to libraries in Python and R. For example, you would compare this to [Pandas](https://pandas.pydata.org) or [R data.frame](https://www.w3schools.com/r/r_data_frames.asp)<BR>
3939
You can slice the data in many different ways. You can join, merge, group-by the data. You can run various statistical, summarization, financial, and ML algorithms on the data. You can add your custom algorithms easily. You can multi-column sort, custom pick and delete the data. And more …<BR>
40-
DataFrame also includes a large collection of analytical algorithms in form of visitors. These are from basic stats such as <I>Mean</I>, <I>Std Deviation</I>, <I>Return</I>, … to more involved analysis such as <I>Affinity Propagation</I>, <I>Polynomial Fit</I>, <I>Fast Fourier transform of arbitrary length</I> … including a good collection of trading indicators. You can also easily add your own algorithms.<BR>
40+
DataFrame also includes a large collection of analytical algorithms in form of visitors. These are from basic stats such as <I>Mean</I>, <I>Std Deviation</I>, <I>Return</I>, … to more involved analysis such as <I>PCA</I>, <I>Polynomial Fit</I>, <I>Fast Fourier transform of arbitrary length</I> … including a good collection of trading indicators. You can also easily add your own algorithms.<BR>
4141
DataFrame also employs extensive multithreading in almost all its API’s, for large datasets. That makes DataFrame especially suitable for analyzing large datasets.<BR>
4242
For basic operations to start you off, see [Hello World](examples/hello_world.cc). For a complete list of features with code samples, see <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/DataFrame.html" target="_blank">documentation</a>.
4343

docs/HTML/DataFrame.html

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,10 @@ <H2 ID="2"><font color="blue">API Reference with code samples <font size="+4">&#
325325
<td title="True, if matches an statistical pattern"><a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/pattern_spec.html">pattern_match</a>()</td>
326326
</tr>
327327

328+
<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
329+
<td title="Calculates Principal Component Analysis (PCA)."><a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/pca_by_eigen.html">pca_by_eigen</a>()</td>
330+
</tr>
331+
328332
<tr class="item" onmouseover="this.style.backgroundColor='#ffff66';" onmouseout="this.style.backgroundColor='#d4e3e5';">
329333
<td title="Returns a mask vector of peaks"><a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/peaks.html">peaks</a>()</td>
330334
</tr>

docs/HTML/covariance_matrix.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@
4949
<PRE><B>
5050
template&lt;typename T&gt;
5151
<a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/Matrix.html">Matrix</a>&lt;T, <a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/Matrix.html">matrix_orient</a>::column_major&gt;
52-
covariance_matrix(std::vector<const char *> &&col_names,
52+
covariance_matrix(std::vector<const char *> &amp;&amp;col_names,
5353
<a href="https://htmlpreview.github.io/?https://github.com/hosseinmoein/DataFrame/blob/master/docs/HTML/NormalizeVisitor.html">normalization_type</a> norm_type =
5454
normalization_type::none) const;
5555
</B></PRE></font>

docs/HTML/pca_by_eigen.html

Lines changed: 182 additions & 0 deletions
Large diffs are not rendered by default.

include/DataFrame/DataFrame.h

Lines changed: 27 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -3739,7 +3739,7 @@ class DataFrame : public ThreadGranularity {
37393739
// Name of the column
37403740
//
37413741
template<typename T, typename C = std::less<T>>
3742-
size_type
3742+
[[nodiscard]] size_type
37433743
inversion_count(const char *col_name) const;
37443744

37453745
// This calculates and returns the variance/covariance matrix of the
@@ -3748,44 +3748,43 @@ class DataFrame : public ThreadGranularity {
37483748
// T:
37493749
// Type of the named columns
37503750
// col_names:
3751-
// Vector of column names
3751+
// Vector of column names
37523752
// norm_type:
37533753
// The method to normalize the columns first before calculations.
37543754
// Default is not normalizing
37553755
//
37563756
template<typename T>
3757-
Matrix<T, matrix_orient::column_major>
3757+
[[nodiscard]] Matrix<T, matrix_orient::column_major>
37583758
covariance_matrix(
37593759
std::vector<const char *> &&col_names,
37603760
normalization_type norm_type = normalization_type::none) const;
37613761

3762-
3763-
3764-
3765-
3766-
3767-
3768-
3769-
3770-
3771-
3772-
// Principal Component Analysis (PCA)
3762+
// This uses Eigenspace evaluation to calculate Principal Component
3763+
// Analysis (PCA).
3764+
// It returns a matrix whose columns are the reduced dimensions with most
3765+
// significant information.
3766+
// PCA is a dimensionality reduction method that is often used to reduce
3767+
// the dimensionality of large data sets, by transforming a large set of
3768+
// variables into a smaller one that still contains most of the information
3769+
// in the large set.
3770+
// Reducing the number of variables of a data set naturally comes at the
3771+
// expense of accuracy, but the trick in dimensionality reduction is to
3772+
// trade a little accuracy for simplicity. Because smaller data sets are
3773+
// easier to explore and visualize, and thus make analyzing data points
3774+
// much easier and faster for machine learning algorithms without
3775+
// extraneous variables to process.
3776+
//
3777+
// T:
3778+
// Type of the named columns
3779+
// col_names:
3780+
// Vector of column names
3781+
// params:
3782+
// Parameters necessary for for this operation
37733783
//
37743784
template<typename T>
3775-
EigenSpace<T>
3776-
prin_comp_analysis(std::vector<const char *> &&col_names,
3777-
const PCAParams params = { }) const;
3778-
3779-
3780-
3781-
3782-
3783-
3784-
3785-
3786-
3787-
3788-
3785+
[[nodiscard]] Matrix<T, matrix_orient::column_major>
3786+
pca_by_eigen(std::vector<const char *> &&col_names,
3787+
const PCAParams params = { }) const;
37893788

37903789
// This function returns a DataFrame indexed by std::string that provides
37913790
// a few statistics about the columns of the calling DataFrame.

include/DataFrame/DataFrameTypes.h

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -717,24 +717,18 @@ struct StationaryTestParams {
717717

718718
// ----------------------------------------------------------------------------
719719

720-
enum class pca_method : unsigned char {
721-
722-
eigen = 1, // Eigen decomposition of the covariance matrix
723-
svd = 2, // Singular Value Decomposition of the data matrix
724-
};
725-
726720
struct PCAParams {
727721

728-
pca_method method { pca_method::eigen };
729722
normalization_type norm_type { normalization_type::z_score };
730723

731-
// if populated, number of eigen components kept.
724+
// If populated (set above zero), number of top eigen values to keep.
732725
//
733-
std::size_t num_comp_kept { 0 };
726+
long num_comp_to_keep { 0 };
734727

735-
// if populated, percentage of eigen components kept -- 0.9 means 90%.
728+
// If populated (num_comp_is 0), percentage of eigen values to keep.
729+
// 0.9 means 90%.
736730
//
737-
double pct_comp_kept { 0.9 };
731+
double pct_comp_to_keep { 0.9 };
738732
};
739733

740734
// ----------------------------------------------------------------------------

include/DataFrame/Internals/DataFrame_get.tcc

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -979,6 +979,105 @@ covariance_matrix(std::vector<const char *> &&col_names,
979979
return (data_mat.covariance());
980980
}
981981

982+
// ----------------------------------------------------------------------------
983+
984+
template<typename I, typename H>
985+
template<typename T>
986+
Matrix<T, matrix_orient::column_major> DataFrame<I, H>::
987+
pca_by_eigen(std::vector<const char *> &&col_names,
988+
const PCAParams params) const {
989+
990+
#ifdef HMDF_SANITY_EXCEPTIONS
991+
if (params.num_comp_to_keep == 0 && params.pct_comp_to_keep < 0.01)
992+
throw NotFeasible("pca_by_eigen(): Parameters don't make sense");
993+
if (params.num_comp_to_keep > long(col_names.size()))
994+
throw NotFeasible("pca_by_eigen(): num_comp_to_keep > #input columns");
995+
#endif // HMDF_SANITY_EXCEPTIONS
996+
997+
// Get the covariance matrix of normalized data
998+
//
999+
const auto var_cov =
1000+
covariance_matrix<T>(
1001+
std::forward<std::vector<const char *>>(col_names),
1002+
params.norm_type);
1003+
1004+
// Calculate Eigen space
1005+
//
1006+
Matrix<T, matrix_orient::row_major> eigenvals;
1007+
Matrix<T, matrix_orient::column_major> eigenvecs;
1008+
1009+
var_cov.eigen_space(eigenvals, eigenvecs, true);
1010+
1011+
// Keep the most significant columns
1012+
//
1013+
Matrix<T, matrix_orient::column_major> mod_evecs { };
1014+
long col_count { 0 };
1015+
1016+
if (params.num_comp_to_keep > 0) {
1017+
col_count = params.num_comp_to_keep;
1018+
}
1019+
else {
1020+
T ev_sum { 0 };
1021+
1022+
for (long c = 0; c < eigenvals.cols(); ++c)
1023+
ev_sum += std::fabs(eigenvals(0, c));
1024+
1025+
T kept_sum { 0 };
1026+
1027+
for (long c = eigenvals.cols() - 1; c >= 0; --c) {
1028+
kept_sum += std::fabs(eigenvals(0, c));
1029+
col_count += 1;
1030+
if ((kept_sum / ev_sum) >= params.pct_comp_to_keep)
1031+
break;
1032+
}
1033+
}
1034+
mod_evecs.resize(eigenvecs.rows(), col_count);
1035+
for (long c = 0; c < col_count; ++c) {
1036+
const long col = eigenvecs.cols() - c - 1;
1037+
1038+
for (long r = 0; r < eigenvecs.rows(); ++r)
1039+
mod_evecs(r, c) = eigenvecs(r, col);
1040+
}
1041+
1042+
// Copy the data matrix
1043+
//
1044+
const size_type col_num = col_names.size();
1045+
size_type min_col_s { indices_.size() };
1046+
std::vector<const ColumnVecType<T> *> columns(col_num, nullptr);
1047+
SpinGuard guard { lock_ };
1048+
1049+
for (size_type i { 0 }; i < col_num; ++i) {
1050+
columns[i] = &get_column<T>(col_names[i], false);
1051+
if (columns[i]->size() < min_col_s)
1052+
min_col_s = columns[i]->size();
1053+
}
1054+
guard.release();
1055+
1056+
Matrix<T, matrix_orient::column_major> data_mat {
1057+
long(min_col_s), long(col_num) };
1058+
auto lbd =
1059+
[&data_mat, &columns = std::as_const(columns)]
1060+
(auto begin, auto end) -> void {
1061+
for (auto i { begin }; i < end; ++i)
1062+
data_mat.set_column(columns[i]->begin(), long(i));
1063+
};
1064+
const auto thread_level =
1065+
(min_col_s >= ThreadPool::MUL_THR_THHOLD || col_num >= 20 )
1066+
? get_thread_level() : 0L;
1067+
1068+
if (thread_level > 2) {
1069+
auto futuers =
1070+
thr_pool_.parallel_loop(size_type(0), col_num, std::move(lbd));
1071+
1072+
for (auto &fut : futuers) fut.get();
1073+
}
1074+
else lbd(size_type(0), col_num);
1075+
1076+
// Return PCA
1077+
//
1078+
return (data_mat * mod_evecs);
1079+
}
1080+
9821081
} // namespace hmdf
9831082

9841083
// ----------------------------------------------------------------------------

include/DataFrame/Utils/Matrix.h

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1487,15 +1487,6 @@ operator * (const Matrix<T, MO1> &lhs, const Matrix<T, MO2> &rhs) {
14871487
return (result);
14881488
}
14891489

1490-
// ----------------------------------------------------------------------------
1491-
1492-
template<typename T>
1493-
struct EigenSpace {
1494-
1495-
Matrix<T, matrix_orient::row_major> eigen_vals { };
1496-
Matrix<T, matrix_orient::column_major> eigen_vecs { };
1497-
};
1498-
14991490
} // namespace hmdf
15001491

15011492
// ----------------------------------------------------------------------------

include/DataFrame/Utils/Matrix.tcc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1190,11 +1190,13 @@ eigen_space(MA1 &eigenvalues, MA2 &eigenvectors, bool sort_values) const {
11901190
if (sort_values) {
11911191
for (size_type c = 0; c < cols() - 1; ++c) {
11921192
size_type min_col { c };
1193+
value_type abs_min_val { std::fabs(tmp_evals(0, c)) };
11931194
value_type min_val { tmp_evals(0, c) };
11941195

11951196
for (size_type cc = c + 1; cc < cols(); ++cc)
1196-
if (tmp_evals(0, cc) < min_val) {
1197+
if (std::fabs(tmp_evals(0, cc)) < abs_min_val) {
11971198
min_col = cc;
1199+
abs_min_val = std::fabs(tmp_evals(0, cc));
11981200
min_val = tmp_evals(0, cc);
11991201
}
12001202

test/dataframe_tester_4.cc

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2269,6 +2269,77 @@ static void test_covariance_matrix() {
22692269

22702270
// ----------------------------------------------------------------------------
22712271

2272+
static void test_pca_by_eigen() {
2273+
2274+
std::cout << "\nTesting pca_by_eigen( ) ..." << std::endl;
2275+
2276+
StrDataFrame df;
2277+
2278+
try {
2279+
df.read("IBM.csv", io_format::csv2);
2280+
}
2281+
catch (const DataFrameError &ex) {
2282+
std::cout << ex.what() << std::endl;
2283+
}
2284+
2285+
const auto pca_mat = df.pca_by_eigen<double>(
2286+
{ "IBM_Close", "IBM_Open", "IBM_High", "IBM_Low" });
2287+
2288+
// Dimensions were reduced to 1 containing at least 90% of the information.
2289+
// This makes sense, since these 4 columns are highly correlated.
2290+
//
2291+
assert(pca_mat.cols() == 1);
2292+
assert(pca_mat.rows() == 5031);
2293+
assert(std::fabs(pca_mat(0, 0) - 197.063) < 0.001);
2294+
assert(std::fabs(pca_mat(1, 0) - 200.875) < 0.001);
2295+
assert(std::fabs(pca_mat(491, 0) - 149.02) < 0.01);
2296+
assert(std::fabs(pca_mat(1348, 0) - 166.44) < 0.01);
2297+
assert(std::fabs(pca_mat(2677, 0) - 333.405) < 0.001);
2298+
assert(std::fabs(pca_mat(5029, 0) - 216.175) < 0.001);
2299+
assert(std::fabs(pca_mat(5030, 0) - 219.555) < 0.001);
2300+
2301+
const auto pca_mat2 = df.pca_by_eigen<double>(
2302+
{ "IBM_Close", "IBM_Open", "IBM_High", "IBM_Low" },
2303+
{ .num_comp_to_keep = 3 });
2304+
2305+
// 3 most significant dimensions are kept.
2306+
// As you can see the first column is unchanged and clearly contains
2307+
// almost all of the information.
2308+
//
2309+
assert(pca_mat2.cols() == 3);
2310+
assert(pca_mat2.rows() == 5031);
2311+
2312+
assert(std::fabs(pca_mat2(0, 0) - 197.063) < 0.001);
2313+
assert(std::fabs(pca_mat2(0, 1) - -0.0951913) < 0.001);
2314+
assert(std::fabs(pca_mat2(0, 2) - 1.85473) < 0.001);
2315+
2316+
assert(std::fabs(pca_mat2(1, 0) - 200.875) < 0.001);
2317+
assert(std::fabs(pca_mat2(1, 1) - -2.08604) < 0.001);
2318+
assert(std::fabs(pca_mat2(1, 2) - 2.68895) < 0.001);
2319+
2320+
assert(std::fabs(pca_mat2(491, 0) - 149.02) < 0.01);
2321+
assert(std::fabs(pca_mat2(491, 1) - -1.34957) < 0.01);
2322+
assert(std::fabs(pca_mat2(491, 2) - 2.09026) < 0.01);
2323+
2324+
assert(std::fabs(pca_mat2(1348, 0) - 166.44) < 0.01);
2325+
assert(std::fabs(pca_mat2(1348, 1) - 0.0354559) < 0.01);
2326+
assert(std::fabs(pca_mat2(1348, 2) - 0.41972) < 0.01);
2327+
2328+
assert(std::fabs(pca_mat2(2677, 0) - 333.405) < 0.001);
2329+
assert(std::fabs(pca_mat2(2677, 1) - -1.33686) < 0.001);
2330+
assert(std::fabs(pca_mat2(2677, 2) - 2.13684) < 0.001);
2331+
2332+
assert(std::fabs(pca_mat2(5029, 0) - 216.175) < 0.001);
2333+
assert(std::fabs(pca_mat2(5029, 1) - -1.18141) < 0.001);
2334+
assert(std::fabs(pca_mat2(5029, 2) - 2.18029) < 0.001);
2335+
2336+
assert(std::fabs(pca_mat2(5030, 0) - 219.555) < 0.001);
2337+
assert(std::fabs(pca_mat2(5030, 1) - -2.66858) < 0.001);
2338+
assert(std::fabs(pca_mat2(5030, 2) - 2.85412) < 0.001);
2339+
}
2340+
2341+
// ----------------------------------------------------------------------------
2342+
22722343
int main(int, char *[]) {
22732344

22742345
MyDataFrame::set_optimum_thread_level();
@@ -2310,6 +2381,7 @@ int main(int, char *[]) {
23102381
test_make_stationary();
23112382
test_StationaryCheckVisitor();
23122383
test_covariance_matrix();
2384+
test_pca_by_eigen();
23132385

23142386
return (0);
23152387
}

0 commit comments

Comments
 (0)