data-analysis – Make Me Engineer

Why does one hot encoding improve machine learning performance? [closed]

May 8, 2023 by Tarik

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain. Suppose you have a dataset having only a single categorical feature “nationality”, with values “UK”, “French” and “US”. Assume, without loss of … Read more

How to merge multiple dataframes

October 12, 2022 by Tarik

Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren’t involved. Just simply merge with DATE as the index and merge using OUTER method (to get all the data). import pandas as pd from functools import reduce df1 = pd.read_table(‘file1.csv’, sep=’,’) df2 = pd.read_table(‘file2.csv’, sep=’,’) df3 = pd.read_table(‘file3.csv’, sep=’,’) Now, … Read more

Fitting polynomial model to data in R

July 18, 2022 by Tarik

To get a third order polynomial in x (x^3), you can do lm(y ~ x + I(x^2) + I(x^3)) or lm(y ~ poly(x, 3, raw=TRUE)) You could fit a 10th order polynomial and get a near-perfect fit, but should you? EDIT: poly(x, 3) is probably a better choice (see @hadley below).

How do I sum values in a column that match a given condition using pandas?

June 26, 2022 by Tarik

The essential idea here is to select the data you want to sum, and then sum them. This selection of data can be done in several different ways, a few of which are shown below. Boolean indexing Arguably the most common way to select the values is to use Boolean indexing. With this method, you … Read more

Peak signal detection in realtime timeseries data

May 21, 2022 by Tarik

Robust peak detection algorithm (using z-scores) I came up with an algorithm that works very well for these types of datasets. It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is … Read more

Python: pandas merge multiple dataframes

May 9, 2022 by Tarik

How to sort a dataFrame in python pandas by two or more columns?

April 30, 2022 by Tarik

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same: df.sort_values([‘a’, ‘b’], ascending=[True, False]) You can use the ascending argument of sort: df.sort([‘a’, ‘b’], ascending=[True, False]) For example: In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=[‘a’,’b’]) … Read more