Plankton Identifier

You can create as many groups as you want. Actually, you can even take advantage to create very specific groups (i.e., to species level). if the classifier model is unable to distinguish all the groups you have defined, you will be able to test different group associations to improve classifier performances using the appropriate tools of the Data Analysis section (see User Guide here)

The larger is the learning set, the better is the recognition accuracy. However, from our experience (zooplankton recognition using Zooscan and Zooprocess), 100 objects per group is a minimum to get a good classifier model.

For Plankton Identifier version below 1.2.6, It is because the decimal separator of your Windows configuration is different from the decimal separator used in PID files. Go to start menu -> parameters -> Control panel -> Regional settings and change decimal separator to '.' or update to version 1.2.6.

Why Random forest method is not working with my computer whereas other methods work ?

It is because your Tanagra version is too old. Update to Tanagra 1.4.22 or above.

Why logistic regression and test 1 are not working with my computer whereas other methods work ?

It is because you are using Tanagra 1.4.22 or above with old versions of tdm files. Update PkID to the last version.

Why Plankton Identifier sometimes crashs when I try to analyse my own PID files ?

It is probably because at least one of your PID files is partially corrupted (i.e. do not strictly fit the expected format). To solve this problem, open your PID files with PID viewer and verify their content.

Original variables are those measured by the image analysis software that generates PID files. Although many image analysis softwares could be used with Plankton Identifier, ImageJ is currently the most commonly used one. This is the reason why customized variables provided with Plankton Identifier are based upon original variables measured by ImageJ and Zooprocess. The list below describes briefly these different variables:

NOTE: Those in italic being meaningless for distinguishing between groups, they MUST be deselected before starting an analysis

Original variables from ImageJ:
Area: Surface of the object in square pixel.
Mean: Average grey value within the object; this is the sum of the grey values of all the pixels in the object divided by the number of pixels
StdDev: Standard deviation of the grey value used to generate the mean grey value
Mode: Modal (most frequently occurring) grey value within the object
Min: Minimum grey value within the object (0 = black)
Max: Maximum grey value within the object (255 = white)
X: X position of the centre of gravity of the object (can be used in customized variables, do not use it directly as a measurement)
Y: Y position of the centre of gravity of the object (can be used in customized variables, do not use it directly as a measurement)
XM: X position of the centre of gravity of the grey level in the object (can be used in customized variables, do not use it directly as a measurement)
YM: Y position of the centre of gravity of the grey level in the object (can be used in customized variables, do not use it directly as a measurement)
Perim: The length of the outside boundary of the object
BX: X coordinate of the top left point of the smallest rectangle enclosing the object (used to extract thumbnails, not really a measurement)
BY: Y coordinate of the top left point of the smallest rectangle enclosing the object (used to extract thumbnails, not really a measurement)
Width: Width of the smallest rectangle enclosing the object (used to extract thumbnails, not really a measurement)
Height: Height of the smallest rectangle enclosing the object (used to extract thumbnails, not really a measurement)
Major: Primary axis of the best fitting ellipse to the object.
Minor: Secondary axis of the best fitting ellipse to the object.
Angle: Angle between the primary axis and a line parallel to the x-axis of the image (used to get object positioning, not really a measurement)
Circ: Circularity = (4 * Pi * Area) / Perim^2 ; a value of 1 indicates a perfect circle, a value approaching 0 indicates an increasingly elongated polygon. (it is the reverse of compactness)
Feret: The maximum Feret's diameter, i.e., the longest distance between any two points along the object boundary.
IntDen: Integrated density. This is the sum of the grey values of the pixels in the object (i.e. = Area*Mean)
Median: Median of the grey value used to generate the mean grey value.
Skew: The third order moment about the mean. It is the measure of lack of symmetry. The skewness for a normal distribution is zero. Negative values for the skewness indicate data that are skewed left and positive values indicate data that are skewed right.
Kurt: The fourth order moment about the mean. It is a measure of whether the data are peaked or flat relative to a normal distribution. Positive kurtosis indicates a peaked distribution and negative kurtosis indicates a flat distribution.
%area: Surface of holes in percentage.
XStart: X coordinate of the top left point of the image (used to locate object, not a measurement)
YStart: Y coordinate of the top left point of the image (used to locate object, not a measurement)
Area_exc: Surface of the object excluding holes in square pixel (=Area*(1-(%area/100))
Mean_exc: Average grey value excluding holes within the object (= IntDen /Area_exc)
Fractal: Fractal dimension of object boundary
Skelarea: Surface of skeleton in square pixel.

Customized variables implemented by Plankton Identifier:
ESD = Equivalent Spherical Diameter = 2 * SQR(Area / Pi)
Elongation = Major / Minor (‘ellipse' elongation)
Range = Max – Min
MeanPos = (Max – Mean) / Range
CentroidsD = racine ((XM – X)^2 + (YM – Y)^2) ; distance between the centroid and the centroid of mass
CV =100 * (StdDev / Mean)
SR =100 * (StdDec / (Max-Min)
PerimAreaexc = Perim / Area_exc
FeretAreaexc = Feret/Area_exc
PerimFeret = Perim / Feret
PerimMaj = Perim / Major
Circexc: = (4 * Pi * Area_exc) / Perim^2
CDexc = (CentroidsD)^2 / Area_exc

The best method to use depends on your samples. Use a test file and the method 'Test 1' (see User Guide here ) to compare supervised learning method performances at once, then select the one that gives the best results in your case.

Use cross-validation (see user guide here ) and look at the confusion matrix. If two groups are not well distinguished but their association make sense, try to put them together in a larger group (see user guide here ) and run cross validation again.

A confusion matrix is a matrix showing the actual versus predicted classifications. A confusion matrix is of size k x k, where k is the number of classes. The following confusion matrix is for k = 2 classes:

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN	n+
Actual Negative	FP	TN	n-
	TP + FP	FN + TN	N

Given a classification model (or classifier) and one instance, there are four possible outcomes:

- If the instance is positive and it is classified as positive, it is counted as a true positive (TP);
- If the instance is positive and it is classified as negative, it as a false negative (FN).
- If the instance is negative and it is classified as negative, it is counted as a true negative (TN);
- If the instance is negative and it is classified as positive, it is counted as false positive (FP).

Given a classifier and a set of instances (the learning set or the testing set) a two by two confusion matrix can be constructed.

Several common performance metrics can then be calculated from it.

Accuracy rate (1 - Err. rate): (TP + TN) / N

The rate of correct predictions made by the model over the data set (N). It corresponds to the numbers along the major diagonal.

NOTE 1 : The re-substitution accuracy rate corresponds to the accuracy made by the model on the learning set . The accuracy on the learning data is NOT a good indicator of performance on future data since it does not measure any not yet seen data. One way to overcome this problem is to estimate accuracy by using an independent testing set that was not used at any time during the learning process. More complex accuracy estimation using re-sampling techniques, such as cross-validation, are commonly used, especially with data sets containing a small number of instances.

NOTE 2 : The use of accuracy to evaluate a model assumes uniform costs of errors and uniform benefits of correct classifications.

Error rate (= 1 - Acc.rate): (FP + FN) / N
The rate of incorrect predictions made by the model over the data set (N). It corresponds to the numbers off the major diagonal (i.e., the confusion).

True positive rate (Recall, Sensitivity): TP / n +
The rate of positives (TP) correctly classified as positive by the model over the positive instances (n + ) of the data set (N).

False positive rate: FP / n -
The rate of negatives (FP) incorrectly classified as positive by the model over the negative instances (n - ) of the data set (N).

Specificity (=1- false positive rate): TN / n -
The rate of negatives (TP) correctly classified by the model over the negative instances (n - ) of the data set (N).

Precision: TP / (TP + FP)
The rate of positives in the positive predicted class.