# Sequence shape detection algorithms comparison [Python]

Motivation

Goal of this post is to compare some sequence shape comparison criteria. This should be considered as a naive introduction to the topic.

Theoretical statements
• Cross-correlation is used to recognize a noisy signal by comparing it with its noiseless version. If a signal is symmetrical, cross-correlation is mathematically equivalent to the convolution.
• Cross-covariance corresponds to the cross-correlation of two zero-mean sequences. If two sequences have 0 mean, then their covariance is same as their correlation.
• Two stochastic processes are called uncorrelated if their covariance is zero for all times
• Cosine similarity is a metric used to measure the distance between two vectors without consindering their magnitude. The similarity is here measured on their orientation. It is used e.g. to match a document to a query whereby the frequency of the query occuring within the document should not be affected by the document’s length (since longer documents could easily increase the query frequency)
• Dynamit Time Warping (DTW) algorithm measures the similarity of two temporal sequences, which vary differently in the time domain. For example, two voices speaking at different speed may be compared using DTW. The Python implementation here is an approximation, called fast DTW.
• Pearson coefficient is a measure (from -1 to +1) of the linear correlation between two variables. A value of 0 means no linear correlation, +1 positive linear correlation, – and −1 negative linear correlation. It corresponds to the covariance of two variable divided by the product of their standard deviation. In the following implementation I modified the result by finally taking its absolute value.
• Spearman coefficient (also from -1 to +1) tells how well a monotonic function can be used to describe the relationship between two variables. It corresponds to the Pearson coefficient between ranks of two variables. A ranked variable is a variable having values which can be put in order. For example, the sequence 2.3, 4.1, 1.2, 5 can be sorted as 1.2(=1st), 2.3(=2nd), 4.1(=3rd), 5(=4th) and therefore the equivaled ranked sequence would be 2, 3, 1, 4. In the following implementation I modified the result by finally taking its absolute value.
• Form and crest factor are called waveform factors and are related to their shapes. The latter describes how extreme the peaks are in a waveform. Both factors are mixed here as components of a vector for each sequence and the resulting cosine similarity is measured.
• Applying the Softmax function to a vector will map each component in the interval from 0 to 1 in such a way to make the components sum equal to 1. These values can be interpreted as probabilities.

Code

You can see the jupyter notebook here or scroll down to the Embedded Jupyter Notebook section.

Observations

Assumptions:

• A set of classes is given and must be sufficient to identify the given sequence
• Softmax applied
• Classes have very different characteristics
• There is no negative ranking before application of Softmax
• No costant value as template

According to the given assumptions and samples, the fast_dtw_criterion gave a good human-liked hard-decision whereby the global shape is taken into account. Correlation_criterion_v1 and correlation_criterion_v2 also are also hard-decision oriented but look meaningful only where they agree with each other. It applies also in the frequency domain.