Skip to content

Fastdup documentation page

The main function is fastdup.run. It runs on a folder or list of images and computes the artifacts needed to compute image similrity, outliers, componetns etc.

fastdup.run

Run fastdup tool for finding duplicate, near duplicates, outliers and clusters of related images in a corpus of images.
The only mandatory argument is image_dir.

Parameters:

Name Type Description Default
input_dir str

Location of the images/videos to analyze.
* A folder
* A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
* A file containing absolute filenames each on its own row.
* A file containing s3 full paths or minio paths each on its own row.
* A python list with absolute filenames.
* A python list with absolute folders, all images and videos on those folders are added recusively
* For run_mode=2, a folder containing fastdup binary features or a file containing list of atrain_feature.dat.csv files in multiple folders
* yolov5 yaml input file containing train and test folders (single folder supported for now)
* We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images.
If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the
known extensions and use opencv to read those formats.
Note: It is not possible to mix compressed (videos or tars/zips) and regular images. Use the flag turi_param='tar_only=1' if you want to ignore images and run from compressed files.
Note: We assume image sizes should be larger or equal to 10x10 pixels. Smaller images (either on width or on height) will be ignored with a warning shown.
Note: It is possible to skip small images also by defining minimum allowed file size using turi_param='min_file_size=1000' (in bytes).
Note: For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow.
Alternatively you can use the flag turi_param='sync_s3_to_local=1' to copy ahead all images on the remote s3 bucket to disk.

''
work_dir str

Path for storing intermediate files and results.

'.'
test_dir str

Optional path for test data. When given similarity of train and test images is compared (vs. train/train or test/test which are not performed).
The following options are supported.
* test_dir can be a local folder path
* An s3:// or minio:// path.
* A python list with absolute filenames
* A file containing absolute filenames each on its own row.

''
compute str

Compute type [cpu|gpu] Note: gpu is supported only in the enterprise version.

'cpu'
verbose boolean

Verbosity.

False
num_threads int

Number of threads. If no value is specified num threads is auto configured by the number of cores.

-1
num_images unsigned long long

Number of images to run on. On default, run on all the images in the image_dir folder.

0
turi_param str

Optional turi parameters seperated by command. Example run: turi_param='nnmodel=0,ccthreshold=0.99'
The following parameters are supported.
* nnmodel=xx, Nearest Neighbor model for clustering the features together. Supported options are 0 = brute_force (exact), 1 = ball_tree and 2 = lsh (both approximate).
* ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar. It is recommended to experiment with this parameter based on your dataset and then visualize the results using fastdup.create_components_gallery().
* run_cc=0|1 run connected components on the resulting similarity graph. Default is 1.
* run_pagerank=0|1 run pagerank on the resulting similarity graph. Default is 1.
* delete_tar=0|1 when working with tar files obtained from cloud storage delete the tar after download
* delete_img=0|1 when working with images obtained from cloud storage delete the image after download
* tar_only=0|1 run only on tar files and ignore images in folders. Default is 0.
* run_stats=0|1 compute image statistics. Default is 1.
* sync_s3_to_local=0|1 In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0.

'nnmodel=0'
distance str

Distance metric for the Nearest Neighbors algorithm. The default is 'cosine' which works well in most cases.
For nn_provider='nnf' the following distance metrics are supported.
When using nnf_mode='Flat': 'cosine', 'euclidean', 'l1','linf','canberra','braycurtis','jensenshannon' are supported.
Otherwise 'cosine' and 'euclidean' are supported.

'cosine'
threshold float

Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical.

0.9
lower_threshold float

Similarity percentile measure to outline images that are far away (outliers) vs. the total distribution. (means 5% out of the total similarities computed).

0.05
model_path str

Optional location of ONNX model file, should not be used.

model_path_full
version bool

Print out the version number. This function takes no argument.

False
nearest_neighbors_k int

For each image, how many similar images to look for.

2
d int

Length of the feature vector. On default it is 576. When you use your own onnx model, change this parameter to the output model feature vector length.

576
run_mode int

run_mode=0 (the default) does the feature extraction and NN embedding to compute all pairs similarities.
It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on).
The features are extracted and saved into the working_dir path (the default features out file nme is
features.dat in the same folder for storing the numpy features and features.dat.csv for storing the
image file names corresponding to the numpy features).
For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error. run_mode=1 computes the extracted features and stores them, does not compute the NN embedding.
For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel.
Use the min_offset and max_offset flags to allocate a subset of the images for each computing node.
Offsets start from 0 to n-1 where n is the number of images in the input_dir folder. run_mode=2 reads a stored feature file and computes the NN embedding to provide similarities.
The input_dir param is ignored, and the work_dir is used to point to the numpy feature file. (Give a full path and filename). run_mode=3 Reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on all images
given by the test_dir parameter. input_dir should point to the location of the train data.
This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset. run_mode=4 reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on pre extracted feature vectors computer by run_mode=1.

0
nn_provider string

Provider of the nearest neighbor algorithm, allowed values are nnf.

'nnf'
min_offset unsigned long long

Optional min offset to start iterating on the full file list.

0
max_offset unsigned long long

Optional max offset to start iterating on the full file list.

0
nnf_mode str

When nn_provider='nnf' selects the nnf model mode.
default is HSNW32. More accurate is Flat.

'HNSW32'
nnf_param str

When nn_provider='nnf' selects assigns optional parameters.
num_em_iter=XX, number of KMeans EM iterations to run. Default is 20. num_clusters=XX, number of KMeans clusters to use. Default is 100.

''
bounding_box str

Optional bounding box to crop images, given as bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used for all images.
Beta release features (need to sign up at https://visual-layer.com): Tt is possible to set bounding_box='face' to crop the face from the image (in case a face is present).
In addition, you can set bounding_box='yolov5s' and we will run yolov5s to create and crop bounding boxes on your data. (We do not host this model, it is downloaded from the relevant github proejct).
For the face/yolov5 crop the margin around the face is defined by turi_param='augmentation_horiz=0.2,augmentation_vert=0.2' where 0.2 mean 20% additional margin around the face relative to the width and height respectively.
It is possible to change the margin, the lowest value is 0 (no margin) and upper allowed value is 1. Default is 0.2.

''
batch_size int

Optional batch size when computing inference. Allowed values < 200. Note: batch_size > 1 is enabled in the enterprise version.

1
resume int

Optional flag to resume from a previous run.

0
high_accuracy bool

Compute a more accurate model. Runtime is increased about 15% and feature vector storage size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in images with many objects.

False

Returns:

Name Type Description
ret int

Status code 0 = success, 1 = error.

Source code in fastdup/__init__.py
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
def run(input_dir='',             
        work_dir='.', 
        test_dir='',
        compute='cpu',    
        verbose=False,     
        num_threads=-1,     
        num_images=0,        
        turi_param='nnmodel=0',  
        distance='cosine',     #distance metric for the nearest neighbor model.
        threshold=0.9,         #threshold for finding simiar images. (allowed values 0->1)
        lower_threshold=0.05,   #lower percentile threshold for finding simiar images (values 0->1)
        model_path=model_path_full,
        license='',            #license string
        version=False,          #show version and exit      
        nearest_neighbors_k=2, 
        d=576,
        run_mode=0,
        nn_provider='nnf',
        min_offset=0,
        max_offset=0, 
        nnf_mode="HNSW32",
        nnf_param="",
        bounding_box="",
        batch_size = 1,
        resume = 0,
        high_accuracy=False):

    '''
    Run fastdup tool for finding duplicate, near duplicates, outliers and clusters of related images in a corpus of images.
    The only mandatory argument is image_dir.

    Args:
        input_dir (str):
            Location of the images/videos to analyze.
                * A folder
                * A remote folder (s3 or minio starting with s3:// or minio://). When using minio append the minio server name for example minio://google/visual_db/sku110k.
                * A file containing absolute filenames each on its own row.
                * A file containing s3 full paths or minio paths each on its own row.
                * A python list with absolute filenames.
                * A python list with absolute folders, all images and videos on those folders are added recusively
                * For run_mode=2, a folder containing fastdup binary features or a file containing list of atrain_feature.dat.csv files in multiple folders
                * yolov5 yaml input file containing train and test folders (single folder supported for now)
                * We support jpg, jpeg, tiff, tif, giff, heif, heic, bmp, png, mp4, avi. In addition we support tar, tar.gz, tgz and zip files containing images.
            If you have other image extensions that are readable by opencv imread() you can give them in a file (each image on its own row) and then we do not check for the
            known extensions and use opencv to read those formats.
            Note: It is not possible to mix compressed (videos or tars/zips) and regular images. Use the flag turi_param='tar_only=1' if you want to ignore images and run from compressed files.
            Note: We assume image sizes should be larger or equal to 10x10 pixels. Smaller images (either on width or on height) will be ignored with a warning shown.
            Note: It is possible to skip small images also by defining minimum allowed file size using turi_param='min_file_size=1000' (in bytes).
            Note: For performance reasons it is always preferred to copy s3 images from s3 to local disk and then run fastdup on local disk. Since copying images from s3 in a loop is very slow.
            Alternatively you can use the flag turi_param='sync_s3_to_local=1' to copy ahead all images on the remote s3 bucket to disk.

        work_dir (str): Path for storing intermediate files and results.

        test_dir (str): Optional path for test data. When given similarity of train and test images is compared (vs. train/train or test/test which are not performed).
            The following options are supported.
                * test_dir can be a local folder path
                * An s3:// or minio:// path.
                * A python list with absolute filenames
                * A file containing absolute filenames each on its own row.

        compute (str): Compute type [cpu|gpu] Note: gpu is supported only in the enterprise version.

        verbose (boolean): Verbosity.

        num_threads (int): Number of threads. If no value is specified num threads is auto configured by the number of cores.

        num_images (unsigned long long): Number of images to run on. On default, run on all the images in the image_dir folder.

        turi_param (str): Optional turi parameters seperated by command. Example run: turi_param='nnmodel=0,ccthreshold=0.99'
            The following parameters are supported.
                * nnmodel=xx, Nearest Neighbor model for clustering the features together. Supported options are 0 = brute_force (exact), 1 = ball_tree and 2 = lsh (both approximate).
                * ccthreshold=xx, Threshold for running connected components to find clusters of similar images. Allowed values 0->1. The default ccthreshold is 0.96. This groups very similar images together, for example identical images or images that went
                    simple transformations like scaling, flip, zoom in. As higher the score the more similar images are grouped by and you will get \
                    smaller clusters. Score 0.9 is pretty broad and will clsuter images together even if they fine details are not similar. \
                    It is recommended to experiment with this parameter based on your dataset and then visualize the results using `fastdup.create_components_gallery()`.
                * run_cc=0|1 run connected components on the resulting similarity graph. Default is 1.
                * run_pagerank=0|1 run pagerank on the resulting similarity graph. Default is 1.
                * delete_tar=0|1 when working with tar files obtained from cloud storage delete the tar after download
                * delete_img=0|1 when working with images obtained from cloud storage delete the image after download
                * tar_only=0|1 run only on tar files and ignore images in folders. Default is 0.
                * run_stats=0|1 compute image statistics. Default is 1.
                * sync_s3_to_local=0|1 In case of using s3 bucket sync s3 to local folder to improve performance. Assumes there is enough local disk space to contain the dataDefault is 0.\


        distance (str): Distance metric for the Nearest Neighbors algorithm. The default is 'cosine' which works well in most cases.
            For nn_provider='nnf' the following distance metrics are supported.
            When using nnf_mode='Flat': 'cosine', 'euclidean', 'l1','linf','canberra','braycurtis','jensenshannon' are supported.
            Otherwise 'cosine' and 'euclidean' are supported.

        threshold (float): Similarity measure in the range 0->1, where 1 is totally identical, 0.98 and above is almost identical.

        lower_threshold (float): Similarity percentile measure to outline images that are far away (outliers) vs. the total distribution. (means 5% out of the total similarities computed).

        model_path (str): Optional location of ONNX model file, should not be used.

        version (bool): Print out the version number. This function takes no argument.

        nearest_neighbors_k (int): For each image, how many similar images to look for.

        d (int): Length of the feature vector. On default it is 576. When you use your own onnx model, change this parameter to the output model feature vector length.

        run_mode (int):
            ==run_mode=0== (the default) does the feature extraction and NN embedding to compute all pairs similarities.
            It uses the input_dir command line argument for finding the directory to run on (or a list of files to run on).
            The features are extracted and saved into the working_dir path (the default features out file nme is
            `features.dat` in the same folder for storing the numpy features and features.dat.csv for storing the
            image file names corresponding to the numpy features).
            For larger dataset it may be wise to split the run into two, to make sure intermediate results are stored in case you encounter an error.\
            \
            ==run_mode=1== computes the extracted features and stores them, does not compute the NN embedding.
            For large datasets, it is possible to run on a few computing nodes, to extract the features, in parallel.
            Use the min_offset and max_offset flags to allocate a subset of the images for each computing node.
            Offsets start from 0 to n-1 where n is the number of images in the input_dir folder.\
            \
            ==run_mode=2== reads a stored feature file and computes the NN embedding to provide similarities.
            The input_dir param is ignored, and the work_dir is used to point to the numpy feature file. (Give a full path and filename).\
            \
            ==run_mode=3== Reads the NN model stored by nnf.index from the work_dir and computes all pairs similarity on all images
            given by the test_dir parameter. input_dir should point to the location of the train data.
            This mode is used for scoring similarities on a new test dataset given a precomputed simiarity index on a train dataset.\
            \
            ==run_mode=4== reads the NN model stored by `nnf.index` from the `work_dir` and computes all pairs similarity on pre extracted feature vectors computer by run_mode=1.\

        nn_provider (string): Provider of the nearest neighbor algorithm, allowed values are nnf.

        min_offset (unsigned long long): Optional min offset to start iterating on the full file list.

        max_offset (unsigned long long): Optional max offset to start iterating on the full file list.

        nnf_mode (str): When nn_provider='nnf' selects the nnf model mode.
            default is HSNW32. More accurate is Flat.

        nnf_param (str): When nn_provider='nnf' selects assigns optional parameters.
            ==num_em_iter=XX==, number of KMeans EM iterations to run. Default is 20.\
            ==num_clusters=XX==, number of KMeans clusters to use. Default is 100.\

        bounding_box (str): Optional bounding box to crop images, given as bounding_box='row_y=xx,col_x=xx,height=xx,width=xx'. This defines a global bounding box to be used for all images.
            Beta release features (need to sign up at https://visual-layer.com): Tt is possible to set bounding_box='face' to crop the face from the image (in case a face is present).
            In addition, you can set bounding_box='yolov5s' and we will run yolov5s to create and crop bounding boxes on your data. (We do not host this model, it is downloaded from the relevant github proejct).
            For the face/yolov5 crop the margin around the face is defined by turi_param='augmentation_horiz=0.2,augmentation_vert=0.2' where 0.2 mean 20% additional margin around the face relative to the width and height respectively.
            It is possible to change the margin, the lowest value is 0 (no margin) and upper allowed value is 1. Default is 0.2.

        batch_size (int): Optional batch size when computing inference. Allowed values < 200. Note: batch_size > 1 is enabled in the enterprise version.

        resume (int): Optional flag to resume from a previous run.

        high_accuracy (bool): Compute a more accurate model. Runtime is increased about 15% and feature vector storage size/ memory is increased about 60%. The upside is the model can distinguish better of minute details in images with many objects.

    Returns:
        ret (int): Status code 0 = success, 1 = error.

    '''
    fastdup_capture_log_debug_state(locals())

    _input_dir = input_dir
    fd_model = False
    if bounding_box == 'face' or bounding_box == 'yolov5s':
        local_model = os.path.join(LOCAL_DIR, 'UndisclosedFDModel.onnx')
        if bounding_box == 'yolov5s':
            local_model = find_model(YOLOV5S_MODEL)

        bounding_box = ''
        turi_param += ",save_crops=1"
        if 'augmentation_horiz' not in turi_param and 'augmentation_vert' not in turi_param:
            turi_param += ",augmentation_horiz=0.2,augmentation_vert=0.2"
        ret = do_run(input_dir=input_dir,
                     work_dir=work_dir,
                     test_dir=test_dir,
                     compute=compute,
                     verbose=verbose,
                     num_threads=num_threads,
                     num_images=num_images,
                     turi_param=turi_param,
                     distance=distance,
                     threshold=threshold,
                     lower_threshold=lower_threshold,
                     model_path=local_model,
                     license=license,
                     version=version,
                     nearest_neighbors_k=nearest_neighbors_k,
                     d=d,
                     run_mode=1,
                     nn_provider=nn_provider,
                     min_offset=min_offset,
                     max_offset=max_offset,
                     nnf_mode=nnf_mode,
                     nnf_param=nnf_param,
                     bounding_box=bounding_box,
                     batch_size = batch_size,
                     resume = resume,
                     high_accuracy=high_accuracy)
        if (ret != 0):
            print("Failed to run fastdup")
            return ret
        try:
            os.unlink(os.path.join(work_dir, 'atrain_features.dat'))
        except:
            pass
        input_dir = os.path.join(work_dir, "crops")

    ret = do_run(input_dir=input_dir,
             work_dir=work_dir,
             test_dir=test_dir,
             compute=compute,
             verbose=verbose,
             num_threads=num_threads,
             num_images=num_images,
             turi_param=turi_param if not fd_model else turi_param.replace(',save_crops=1','').replace('save_crops=1',''),
             distance=distance,
             threshold=threshold,
             lower_threshold=lower_threshold,
             model_path=model_path,
             license=license,
             version=version,
             nearest_neighbors_k=nearest_neighbors_k,
             d=d,
             run_mode=run_mode,
             nn_provider=nn_provider,
             min_offset=min_offset,
             max_offset=max_offset,
             nnf_mode=nnf_mode,
             nnf_param=nnf_param,
             bounding_box=bounding_box,
             batch_size = batch_size,
             resume = resume,
             high_accuracy=high_accuracy)
    return ret

fastdup.run_kmeans

Run KMeans algorithm on a folder of images given by input_dir and save the results to work_dir.
Fastdup will extract feature vectors using the model specified by model_path and then run KMeans to cluster the vectors.
The results will be saved to work_dir in the following format:
- kmeans_centroids.csv: a csv file containing the centroids of the clusters.
- kmeans_assignments.csv: assignment of each data point to the closet centroids (number of centroids given by nearest_neighbors_k).
After running kmeans you can use create_kmeans_clusters_gallery to view the results.

Parameters:

Name Type Description Default
input_dir str

path to the folder containing the images to be clustered. See fastdup.run for more details.

''
work_dir str

path to the folder where the results will be saved.

'.'
verbose bool

verbosity level, default False

False
num_clusters int

Number of KMeans clusters to use

100
num_em_iter int

Number of em iterations

20
num_threads int

Number of threads for performing the feature vector extraction

-1
num_images int

Limit the number of images

0
model_path str

Model path for the model to be used for feature vector extraction

model_path_full
license str

License string

''
nearest_neighbors_k int

When assigning an image into a cluster, how many clusters to assign to (starting from the closest)

2
d int

Dimension of the feature vector

576
bounding_box str

Optional bounding box see fastdup:::run for more details

''
high_accuracy bool

Use higher accuracy model for the feature extraction

False

Returns:

Name Type Description
ret int

0 in case of success, 1 in case of error

Source code in fastdup/__init__.py
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
def run_kmeans(input_dir='',
               work_dir='.',
               verbose=False,
               num_clusters=100,
               num_em_iter=20,
               num_threads=-1,
               num_images=0,
               model_path=model_path_full,
               license='',            #license string
               nearest_neighbors_k=2,
               d=576,
               bounding_box="",
               high_accuracy=False):
    """
    Run KMeans algorithm on a folder of images given by `input_dir` and save the results to `work_dir`.
    Fastdup will extract feature vectors using the model specified by `model_path` and then run KMeans to cluster the vectors.
    The results will be saved to `work_dir` in the following format:
    - `kmeans_centroids.csv`: a csv file containing the centroids of the clusters.
    - `kmeans_assignments.csv`: assignment of each data point to the closet centroids (number of centroids given by `nearest_neighbors_k`).
    After running kmeans you can use `create_kmeans_clusters_gallery` to view the results.

    Args:
        input_dir (str): path to the folder containing the images to be clustered. See `fastdup.run` for more details.
        work_dir (str): path to the folder where the results will be saved.
        verbose (bool): verbosity level, default False
        num_clusters (int): Number of KMeans clusters to use
        num_em_iter (int): Number of em iterations
        num_threads (int): Number of threads for performing the feature vector extraction
        num_images (int): Limit the number of images
        model_path (str): Model path for the model to be used for feature vector extraction
        license (str): License string
        nearest_neighbors_k (int): When assigning an image into a cluster, how many clusters to assign to (starting from the closest)
        d (int): Dimension of the feature vector
        bounding_box (str): Optional bounding box see fastdup:::run for more details
        high_accuracy (bool): Use higher accuracy model for the feature extraction

    Returns:
        ret (int): 0 in case of success, 1 in case of error
    """
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        assert num_clusters >= 2, "Number of clusters must be at least 2, got {}".format(num_clusters)
        assert num_em_iter >=1, "Number of EM iterations must be at least 1, got {}".format(num_em_iter)

        ret = run(input_dir=input_dir,
                work_dir=work_dir,
                verbose=verbose,
                num_threads=num_threads,
                num_images=num_images,
                model_path=model_path,
                license=license,            #license string
                nearest_neighbors_k=nearest_neighbors_k,
                d=d,
                run_mode=5,
                nnf_param=f"num_clusters={num_clusters},num_em_iter={num_em_iter}",
                bounding_box=bounding_box,
                high_accuracy=high_accuracy)
        fastdup_performance_capture("run_kmeans", start_time)
        return ret

    except Exception as ex:
        fastdup_capture_exception("run_kmeans", ex)

fastdup.run_kmeans_on_extracted

Run KMeans algorithm on a folder of extracted feature vectors (created on default when running fastdup:::run).
The results will be saved to work_dir in the following format:
- kmeans_centroids.csv: a csv file containing the centroids of the clusters. In each row one centroid. In total num_clusters rows.
- kmeans_assignments.csv: assignment of each data point to the closet centroids (number of centroids given by nearest_neighbors_k). In each row the image filename is listed, centoid id (starting from zero) and the L2 distance to the centroid.
After running kmeans you can use fastdup:::create_kmeans_clusters_gallery to view the results.

Parameters:

Name Type Description Default
input_dir str

path to the folder containing the images to be clustered. See fastup:::run for more details.

''
work_dir str

path to the folder where the results will be saved.

'.'
verbose bool

verbosity level, default False

False
num_clusters int

Number of KMeans clusters to use

100
num_em_iter int

Number of em iterations

20
num_threads int

Number of threads for performing the feature vector extraction

-1
num_images int

Limit the number of images

0
model_path str

Model path for the model to be used for feature vector extraction

model_path_full
license str

License string

''
nearest_neighbors_k int

When assigning an image into a cluster, how many clusters to assign to (starting from the closest)

2
d int

Dimension of the feature vector

576

Returns:

Name Type Description
ret int

0 in case of success, 1 in case of error

Source code in fastdup/__init__.py
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
def run_kmeans_on_extracted(input_dir='',
               work_dir='.',
               verbose=False,
               num_clusters=100,
               num_em_iter=20,
               num_threads=-1,
               num_images=0,
               model_path=model_path_full,
               license='',            #license string
               nearest_neighbors_k=2,
               d=576):
    """
    Run KMeans algorithm on a folder of extracted feature vectors (created on default when running fastdup:::run).
    The results will be saved to `work_dir` in the following format:
    - `kmeans_centroids.csv`: a csv file containing the centroids of the clusters. In each row one centroid. In total `num_clusters` rows.
    - `kmeans_assignments.csv`: assignment of each data point to the closet centroids (number of centroids given by `nearest_neighbors_k`). In each row the image filename is listed, centoid id (starting from zero) and the L2 distance to the centroid.
    After running kmeans you can use `fastdup:::create_kmeans_clusters_gallery` to view the results.

    Args:
        input_dir (str): path to the folder containing the images to be clustered. See fastup:::run for more details.
        work_dir (str): path to the folder where the results will be saved.
        verbose (bool): verbosity level, default False
        num_clusters (int): Number of KMeans clusters to use
        num_em_iter (int): Number of em iterations
        num_threads (int): Number of threads for performing the feature vector extraction
        num_images (int): Limit the number of images
        model_path (str): Model path for the model to be used for feature vector extraction
        license (str): License string
        nearest_neighbors_k (int): When assigning an image into a cluster, how many clusters to assign to (starting from the closest)
        d (int): Dimension of the feature vector

    Returns:
        ret (int): 0 in case of success, 1 in case of error
    """

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        assert num_clusters >= 2, "Number of clusters must be at least 2, got {}".format(num_clusters)
        assert num_em_iter >=1, "Number of EM iterations must be at least 1, got {}".format(num_em_iter)

        ret = run(input_dir=input_dir,
                   work_dir=work_dir,
                   verbose=verbose,
                   num_threads=num_threads,
                   num_images=num_images,
                   model_path=model_path,
                   license=license,            #license string
                   nearest_neighbors_k=nearest_neighbors_k,
                   d=d,
                   run_mode=6,
                   nnf_param=f"num_clusters={num_clusters},num_em_iter={num_em_iter}")
        fastdup_performance_capture("run_kmeans_on_extracted", start_time)
        return ret
    except Exception as ex:
        fastdup_capture_exception("run_kmeans_on_extracted", ex)

Fastdup visualization of results

Visualization of the output data is done using the following functions:

fastdup.create_duplicates_gallery

Function to create and display a gallery of duplicate/near duplicate images as computed by the similarity metric.

In addition, it is possible to compute hierarchical gallery of duplicate/near duplicate clusters. For doing so need to
(A) Run fastdup to compute similarity on work_dir
(B) Run connected components on the work_dir saving the component results to save_path (need to run with lazy_load=True)
(C) Run create_duplicates_gallery() on the components to find pairs of similar components. Point the similarity_file to similarity_hierarchical_XX.csv file where XX is the
connected components threshold (ccthreshold=XX).

Example

import fastdup
fastdup.run('input_folder', 'output_folder')
fastdup.create_duplicates_gallery('output_folder', save_path='.', get_label_func = lambda x: x.split('/')[1], slice='hamburger')

Regarding get_label_func, this example assumes that the second folder name is the class name for example my_data/hamburger/image001.jpg. You can change it to match your own labeling convention.

Parameters:

Name Type Description Default
similarity_file str

csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
descending boolean

If False, print the similarities from the least similar to the most similar. Default is True.

True
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
slice str

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
Note that when using slice, the function get_label_function should be implmeneted.

None
max_width int

Optional parameter to set the max width of the gallery.

None
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding additional column to the report

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None
work_dir str

Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file path

None
threshold float

Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
Allowed values are between 0 and 1.

None
save_artifacts boolean

Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder

required
Source code in fastdup/__init__.py
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
def create_duplicates_gallery(similarity_file, save_path, num_images=20, descending=True,
                              lazy_load=False, get_label_func=None, slice=None, max_width=None,
                              get_bounding_box_func=None, get_reformat_filename_func=None, get_extra_col_func=None,
                              input_dir=None, work_dir=None, threshold=None, **kwargs):
    '''

    Function to create and display a gallery of duplicate/near duplicate images as computed by the similarity metric.

    In addition, it is possible to compute hierarchical gallery of duplicate/near duplicate clusters. For doing so need to
        (A) Run fastdup to compute similarity on work_dir
        (B) Run connected components on the work_dir saving the component results to save_path (need to run with lazy_load=True)
        (C) Run create_duplicates_gallery() on the components to find pairs of similar components. Point the similarity_file to similarity_hierarchical_XX.csv file where XX is the
        connected components threshold (ccthreshold=XX).

    Example:
        >>> import fastdup
        >>> fastdup.run('input_folder', 'output_folder')
        >>> fastdup.create_duplicates_gallery('output_folder', save_path='.', get_label_func = lambda x: x.split('/')[1], slice='hamburger')

    Regarding get_label_func, this example assumes that the second folder name is the class name for example my_data/hamburger/image001.jpg. You can change it to match your own labeling convention.


    Args:
        similarity_file (str): csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        descending (boolean): If False, print the similarities from the least similar to the most similar. Default is True.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        slice (str): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
            slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
            Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
            Note that when using slice, the function get_label_function should be implmeneted.

        max_width (int): Optional parameter to set the max width of the gallery.

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.
            The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding additional column to the report

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

        work_dir (str): Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file path

        threshold (float): Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
            Allowed values are between 0 and 1.

        save_artifacts (boolean): Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder

   '''
    fastdup_capture_log_debug_state(locals())

    try:
        start_time = time.time()
        ret = check_params(similarity_file, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret;

        similarity_file, input_dir = load_dataframe(similarity_file, "similarity", input_dir, work_dir, kwargs, ["from", "to", "distance"])


        ret = do_create_duplicates_gallery(similarity_file, save_path, num_images, descending, lazy_load, get_label_func, slice, max_width, get_bounding_box_func,
                                            get_reformat_filename_func, get_extra_col_func, input_dir, work_dir, threshold, kwargs)
        fastdup_performance_capture("create_duplicates_gallery", start_time)
        return ret

    except Exception as ex:
        fastdup_capture_exception("create_duplicates_gallery", ex)

fastdup.create_duplicate_videos_gallery

Function to create and display a gallery of duplicaate videos computed by the similarity metrics

Example

import fastdup
fastdup.run('input_folder', 'output_folder', run_mode=1) # extract frames from videos
fastdup.run('input_folder', 'output_folder', run_mode=2) # run fastdup
fastdup.create_duplicates_videos_gallery('output_folder', save_path='.')

Parameters:

Name Type Description Default
similarity_file str

csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
descending boolean

If False, print the similarities from the least similar to the most similar. Default is True.

True
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
slice str

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
Note that when using slice, the function get_label_function should be implmeneted.

None
max_width int

Optional parameter to set the max width of the gallery.

None
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding additional column to the report

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None
work_dir str

Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file path

None
threshold float

Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
Allowed values are between 0 and 1.

None
save_artifacts boolean

Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder

required
Source code in fastdup/__init__.py
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
def create_duplicate_videos_gallery(similarity_file, save_path, num_images=20, descending=True,
                              lazy_load=False, get_label_func=None, slice=None, max_width=None,
                              get_bounding_box_func=None, get_reformat_filename_func=None, get_extra_col_func=None, input_dir=None, work_dir=None, threshold=None, **kwargs):
    '''

    Function to create and display a gallery of duplicaate videos computed by the similarity metrics

    Example:
        >>> import fastdup
        >>> fastdup.run('input_folder', 'output_folder', run_mode=1)  # extract frames from videos
        >>> fastdup.run('input_folder', 'output_folder', run_mode=2)  # run fastdup
        >>> fastdup.create_duplicates_videos_gallery('output_folder', save_path='.')


    Args:
        similarity_file (str): csv file with the computed similarities by the fastdup tool, or a work_dir path, or a pandas dataframe containing the similarities.

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        descending (boolean): If False, print the similarities from the least similar to the most similar. Default is True.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        slice (str): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
            slice could be a specific label i.e. slice='haumburger' and in that case only similarities between hamburger and other classes are presented.
            Two reserved arguments for slice are "diff" and "same". When using "diff" the report only shows similarities between classes. When using "same" the report will show only similarities inside same class.
            Note that when using slice, the function get_label_function should be implmeneted.

        max_width (int): Optional parameter to set the max width of the gallery.

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists


        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.
            The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding additional column to the report

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

        work_dir (str): Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a duplicate file path

        threshold (float): Optional parameter to specify the threshold for similarity score to be considered as duplicate. Values above the threshold will be considered as duplicate.
            Allowed values are between 0 and 1.

        save_artifacts (boolean): Optional parameter to allow saving the intermediate artifacts (raw images, csv with results) to the output folder

   '''

    try:
        fastdup_capture_log_debug_state(locals())
        start_time = time.time()
        ret = check_params(similarity_file, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret

        if threshold:
            assert threshold >= 0 and threshold <= 1, "threshold should be between 0 and 1"

        if work_dir is None and isinstance(similarity_file, str):
            if  os.path.isdir(similarity_file):
                work_dir = similarity_file
            else:
                work_dir = os.path.dirname(os.path.abspath(similarity_file))

        df, input_dir = load_dataframe(similarity_file, "similarity", input_dir, work_dir, kwargs, ["from", "to", "distance"])
        if threshold is not None:
            df = df[df['distance'] >= threshold]

        df = remove_duplicate_video_distances(df, kwargs)
        kwargs['is_video'] = True

        ret = create_duplicates_gallery(df, save_path, num_images, descending, lazy_load, get_label_func, slice, max_width, get_bounding_box_func,
                                                get_reformat_filename_func, get_extra_col_func, input_dir, work_dir, threshold, **kwargs)
        fastdup_performance_capture("create_duplicates_gallery", start_time)
        return ret
    except Exception as ex:
        fastdup_capture_exception( "create_duplicates_gallery", ex)

fastdup.create_outliers_gallery

Function to create and display a gallery of images computed by the outliers metrics.
Outliers are computed using the fastdup tool, by embedding each image to a short feature vector, finding top k similar neighbors
and finding images that are further away from all other images, i.e. outliers.
On default fastdup saves the outliers into a file called outliers.csv inside the work_dir folder.
It is possible to load this file using pandas to get the list of outlir images.
Note that the number of images included in the outliers file depends on the lower_threshold parameter in the fastdup run. This command line argument is a percentile
i.e. 0.05 means top 5% of the images that are further away from the rest of the images are considered outliers.

Parameters:

Name Type Description Default
outliers_file str

csv file with the computed outliers by the fastdup tool, or a work_dir path, or a pandas dataframe contraining the outliers

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
how str

Optional outlier selection method. one = take the image that is far away from any one image (but may have other images close to it).
all = take the image that is far away from all other images. Default is one.

'one'
slice str

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

None
max_width int

Optional parameter to set the max width of the gallery.

get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding additional column to the report

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None
work_dir str

Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a outliers file path

None
Source code in fastdup/__init__.py
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
def create_outliers_gallery(outliers_file, save_path, num_images=20, lazy_load=False, get_label_func=None,
                            how='one', slice=None, max_width=None, get_bounding_box_func=None,
                            get_reformat_filename_func=None, get_extra_col_func=None, input_dir =None, work_dir=None, **kwargs):
    '''

    Function to create and display a gallery of images computed by the outliers metrics.
    Outliers are computed using the fastdup tool, by embedding each image to a short feature vector, finding top k similar neighbors
    and finding images that are further away from all other images, i.e. outliers.
    On default fastdup saves the outliers into a file called `outliers.csv` inside the `work_dir` folder.
    It is possible to load this file using pandas to get the list of outlir images.
    Note that the number of images included in the outliers file depends on the `lower_threshold` parameter in the fastdup run. This command line argument is a percentile
    i.e. 0.05 means top 5% of the images that are further away from the rest of the images are considered outliers.

    Parameters:
        outliers_file (str): csv file with the computed outliers by the fastdup tool, or a work_dir path, or a pandas dataframe contraining the outliers

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        how (str): Optional outlier selection method. one = take the image that is far away from any one image (but may have other images close to it).
                                                      all = take the image that is far away from all other images. Default is one.

        slice (str): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

        max_width (int): Optional parameter to set the max width of the gallery.

         get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.
            The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding additional column to the report

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

        work_dir (str): Optional parameter to specify fastdup work_dir, when using a pd.DataFrame instead of a outliers file path

     '''

    try:
        fastdup_capture_log_debug_state(locals())

        start_time = time.time()
        ret = check_params(outliers_file, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret;

        outliers_file, input_dir = load_dataframe(outliers_file, "outliers", input_dir, work_dir, kwargs, ['from', "to", "distance"])
        assert how == 'one' or how == 'all', "Wrong argument to how=[one|all]"

        ret= do_create_outliers_gallery(outliers_file, save_path, num_images, lazy_load, get_label_func, how, slice,
                                          max_width, get_bounding_box_func, get_reformat_filename_func, get_extra_col_func, input_dir, work_dir,
                                        **kwargs)
        fastdup_performance_capture("create_outliers_gallery", start_time)
        return ret


    except Exception as ex:
            fastdup_capture_exception("create_outliers_gallery", ex)

fastdup.create_components_gallery

Function to create and display a gallery of images for the largest graph components

Parameters:

Name Type Description Default
work_dir str

path to fastdup work_dir, or a path to connected component csv file. Altenatively dataframe with connected_compoennts.csv content from previous fastdup run.

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
group_by str

[visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.

'visual'
slice str or list

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

None
max_width int

Optional parameter to set the max html width of images in the gallery. Default is None.

None
max_items int

Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.

None
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding more information to the report.

None
threshold float

Optional parameter to set the treshold for chosing components. Default is None.

None
metric str

Optional parameter to set the metric to use (like blur) for chose components. Default is None.

None
descending boolean

Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.

True
min_items int

Optional parameter to select components with min_items or more items. Default is None.

None
keyword str

Optional parameter to select components with keyword asa subset of the label. Default is None.

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None
kwargs dict

Optional parameter to pass additional parameters to the function.

{}
split_sentence_to_label_list boolean

Optional parameter to split the label into a list of labels. Default is False.

required
limit_labels_printed int

Optional parameter to limit the number of labels printed in the html report. Default is max_items.

required
nrows int

limit the number of read rows for debugging purposes of the report

required
save_artifacts bool

Optional param to save intermediate artifacts like image paths used for generating the component

required

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1

Source code in fastdup/__init__.py
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
def create_components_gallery(work_dir, save_path, num_images=20, lazy_load=False, get_label_func=None,
                              group_by='visual', slice=None, max_width=None, max_items=None, get_bounding_box_func=None,
                              get_reformat_filename_func=None, get_extra_col_func=None, threshold=None, metric=None,
                              descending=True, min_items=None, keyword=None, input_dir=None, **kwargs):
    '''

    Function to create and display a gallery of images for the largest graph components

    Args:
        work_dir (str): path to fastdup work_dir, or a path to connected component csv file. Altenatively dataframe with connected_compoennts.csv content from previous fastdup run.

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        group_by (str): [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.

        slice (str or list): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

        max_width (int): Optional parameter to set the max html width of images in the gallery. Default is None.

        max_items (int): Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.  The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding more information to the report.

        threshold (float): Optional parameter to set the treshold for chosing components. Default is None.

        metric (str): Optional parameter to set the metric to use (like blur) for chose components. Default is None.

        descending (boolean): Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.

        min_items (int): Optional parameter to select components with min_items or more items. Default is None.

        keyword (str): Optional parameter to select components with keyword asa subset of the label. Default is None.

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

        kwargs (dict): Optional parameter to pass additional parameters to the function.

        split_sentence_to_label_list (boolean): Optional parameter to split the label into a list of labels. Default is False.

        limit_labels_printed (int): Optional parameter to limit the number of labels printed in the html report. Default is max_items.
        nrows (int): limit the number of read rows for debugging purposes of the report
        save_artifacts (bool): Optional param to save intermediate artifacts like image paths used for generating the component

    Returns:
        ret (int): 0 in case of success, otherwise 1
    '''

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())
        ret = check_params(work_dir, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret

        if max_items is not None:
            assert isinstance(max_items, int), "max items should be an integer"
            assert max_items > 0, "html image width should be > 0"

        if isinstance(work_dir, str):
            config = load_config(os.path.dirname(work_dir))
            if input_dir is None and config is not None and 'input_dir' in config:
                input_dir = config['input_dir']
        elif isinstance(work_dir, pd.DataFrame):
            assert input_dir is not None, "When passing dataframe need to point input_dir to the previous work_dir"
            assert len(work_dir), "Empty dataframe encountered"
            assert 'component_id' in work_dir.columns, "Connected components dataframe should contain 'component_id' column"
            assert '__id' in work_dir.columns or 'len' in work_dir.columns, "Connected components dataframe should contain '__id' column"
        else:
            assert False, f"Wrong work_dir type {type(work_dir)}"

        ret = do_create_components_gallery(work_dir, save_path, num_images, lazy_load, get_label_func, group_by, slice,
                                            max_width, max_items, min_items, get_bounding_box_func,
                                            get_reformat_filename_func, get_extra_col_func, threshold, metric=metric,
                                            descending=descending, keyword=keyword, comp_type="component", input_dir=input_dir, kwargs=kwargs)
        fastdup_performance_capture("create_components_gallery", start_time)
        return ret

    except Exception as ex:
        fastdup_capture_exception("create_components_gallery", ex)

fastdup.create_component_videos_gallery

Function to create and display a gallery of similar videos based on the graph components

Parameters:

Name Type Description Default
work_dir str

path to fastdup work_dir

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
group_by str

[visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.

'visual'
slice str or list

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

None
max_width int

Optional parameter to set the max html width of images in the gallery. Default is None.

None
max_items int

Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.

None
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding more information to the report.

None
threshold float

Optional parameter to set the treshold for chosing components. Default is None.

None
metric str

Optional parameter to set the metric to use (like blur) for chose components. Default is None.

None
descending boolean

Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.

True
min_items int

Optional parameter to select components with min_items or more items. Default is None.

None
keyword str

Optional parameter to select components with keyword asa subset of the label. Default is None.

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1

Source code in fastdup/__init__.py
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
def create_component_videos_gallery(work_dir, save_path, num_images=20, lazy_load=False, get_label_func=None,
                              group_by='visual', slice=None, max_width=None, max_items=None, get_bounding_box_func=None,
                              get_reformat_filename_func=None, get_extra_col_func=None, threshold=None, metric=None,
                              descending=True, min_items=None, keyword=None, input_dir=None, **kwargs):
    '''

    Function to create and display a gallery of similar videos based on the graph components

    Args:
        work_dir (str): path to fastdup work_dir

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        group_by (str): [visual|label]. Group the report using the visual properties of the image or using the labels of the images. Default is visual.

        slice (str or list): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

        max_width (int): Optional parameter to set the max html width of images in the gallery. Default is None.

        max_items (int): Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.  The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding more information to the report.

        threshold (float): Optional parameter to set the treshold for chosing components. Default is None.

        metric (str): Optional parameter to set the metric to use (like blur) for chose components. Default is None.

        descending (boolean): Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.

        min_items (int): Optional parameter to select components with min_items or more items. Default is None.

        keyword (str): Optional parameter to select components with keyword asa subset of the label. Default is None.

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

    Returns:
        ret (int): 0 in case of success, otherwise 1
    '''

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        kwargs['is_video'] = True
        df, input_dir = load_dataframe(work_dir, "similarity", input_dir, work_dir, kwargs, ["from", "to", "distance"])
        df = remove_duplicate_video_distances(df, kwargs)
        if df is None:
            return 1

        ret = create_components_gallery(work_dir, save_path=save_path, num_images=num_images, lazy_load=lazy_load,
                                         get_label_func=get_label_func, group_by=group_by, slice=slice,
                                         max_width=max_width, max_items=max_items, get_bounding_box_func=get_bounding_box_func,
                                         get_reformat_filename_func=get_reformat_filename_func, get_extra_col_func=get_extra_col_func, threshold=threshold, metric=metric,
                                         descending=descending, min_items=min_items, keyword=keyword, comp_type="component",
                                         input_dir=input_dir, **kwargs)
        fastdup_performance_capture("create_component_video_gallery", start_time)
        return ret

    except Exception as ex:
            fastdup_capture_exception("create_component_videos_gallery", ex)

fastdup.create_kmeans_clusters_gallery

Function to visualize the kmeans clusters.

Parameters:

Name Type Description Default
work_dir str

path to fastdup work_dir

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
slice str or list

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

None
max_width int

Optional parameter to set the max html width of images in the gallery. Default is None.

None
max_items int

Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.

None
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string. The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding more information to the report.

None
threshold float

Optional parameter to set the treshold for chosing components. Default is None.

None
metric str

Optional parameter to set the metric to use (like blur) for chose components. Default is None.

None
descending boolean

Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.

True
min_items int

Optional parameter to select components with min_items or more items. Default is None.

None
keyword str

Optional parameter to select components with keyword asa subset of the label. Default is None.

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1

Source code in fastdup/__init__.py
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
def create_kmeans_clusters_gallery(work_dir, save_path, num_images=20, lazy_load=False, get_label_func=None,
                            slice=None, max_width=None, max_items=None, get_bounding_box_func=None,
                              get_reformat_filename_func=None, get_extra_col_func=None, threshold=None, metric=None,
                              descending=True, min_items=None, keyword=None, input_dir=None, **kwargs):
    '''
    Function to visualize the kmeans clusters.

    Args:

        work_dir (str): path to fastdup work_dir

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        slice (str or list): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

        max_width (int): Optional parameter to set the max html width of images in the gallery. Default is None.

        max_items (int): Optional parameter to limit the number of items displayed (labels for group_by='visual' or components for group_by='label'). Default is None.

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.  The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding more information to the report.

        threshold (float): Optional parameter to set the treshold for chosing components. Default is None.

        metric (str): Optional parameter to set the metric to use (like blur) for chose components. Default is None.

        descending (boolean): Optional parameter to set the order of the components. Default is True namely list components from largest to smallest.

        min_items (int): Optional parameter to select components with min_items or more items. Default is None.

        keyword (str): Optional parameter to select components with keyword asa subset of the label. Default is None.

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

    Returns:
         ret (int): 0 in case of success, otherwise 1
    '''

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        if isinstance(work_dir, str):
            config = load_config(os.path.dirname(work_dir))
            if input_dir is None and config is not None and 'input_dir' in config:
                input_dir = config['input_dir']

        ret = check_params(work_dir, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret

        ret = do_create_components_gallery(work_dir, save_path, num_images, lazy_load, get_label_func,
                                            'visual', slice, max_width, max_items, min_items, get_bounding_box_func,
                                            get_reformat_filename_func, get_extra_col_func, threshold, metric=metric,
                                            descending=descending, keyword=keyword, comp_type="cluster",
                                            input_dir=input_dir, kwargs=kwargs)
        fastdup_performance_capture("create_components_gallery", start_time)
        return ret

    except Exception as ex:
        fastdup_capture_exception("create_kmeans_clusters_gallery", ex)

fastdup.create_stats_gallery

Function to create and display a gallery of images computed by the statistics metrics.
Supported metrics are: mean (color), max (color), min (color), stdv (color), unique (number of unique colors), bluriness (computed by the variance of the laplpacian method
see https://theailearner.com/2021/10/30/blur-detection-using-the-variance-of-the-laplacian-method/.
The metrics are created by fastdup.run() and stored into the work_dir into a file named atrain_stats.csv. Note that the metrics are computed
on the fly fastdup loads and resizes every image only once.

Parameters:

Name Type Description Default
stats_file str

csv file with the computed image statistics by the fastdup tool, alternatively a pandas dataframe. Default stats file is saved by fastdup.run() into the folder work_dir as atrain_stats.csv.

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
metric str

Optional metric selection. Supported metrics are: * width - of original image before resize * height - of original image before resize * size - area * file_size - file size in bytes * blur - variance of the laplacian * unique - number of unique colors, 0..255 * mean - mean color 0..255 * max - max color 0..255 * min - min color 0..255
Advanced metris include (for running advanced metrics, run with turi_param='run_advanced_stats=1') * contrast * rms_contrast - square root of mean sum of stdv/mean per channel * mean_rel_intensity_r * mean_rel_intensity_b * mean_rel_intensity_g * mean_hue - transform to HSV and compute mean H * mean_saturation - transform to HSV and compute mean S * mean_val - transform to HSV and compute mean V * edge_density - using canny filter * mean_r - mean of R channel * mean_g - mean of G channel * mean_b - mean of B channel

'blur'
slice str

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

None
max_width int

Option parameter to select the maximal image width in the report

None
descending bool

Optional parameter to control the order of the metric

False
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string.
The input is an absolute path to the image and the output is the string to display instead of the filename.

None
get_extra_col_func callable

Optional parameter to allow adding extra columns to the gallery.

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None
work_dir str

Optional parameter to fastdup work_dir. Needed when stats file is a pd.DataFrame.

None

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1.

Source code in fastdup/__init__.py
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
def create_stats_gallery(stats_file, save_path, num_images=20, lazy_load=False, get_label_func=None,
                            metric='blur', slice=None, max_width=None, descending= False, get_bounding_box_func=None,
                         get_reformat_filename_func=None, get_extra_col_func=None, input_dir=None, work_dir=None, **kwargs):
    '''
    Function to create and display a gallery of images computed by the statistics metrics.
    Supported metrics are: mean (color), max (color), min (color), stdv (color), unique (number of unique colors), bluriness (computed by the variance of the laplpacian method
    see https://theailearner.com/2021/10/30/blur-detection-using-the-variance-of-the-laplacian-method/.
    The metrics are created by fastdup.run() and stored into the `work_dir` into a file named `atrain_stats.csv`. Note that the metrics are computed
    on the fly fastdup loads and resizes every image only once.

    Args:
        stats_file (str): csv file with the computed image statistics by the fastdup tool, alternatively a pandas dataframe. Default stats file is saved by fastdup.run() into the folder `work_dir` as `atrain_stats.csv`.

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        metric (str): Optional metric selection. Supported metrics are:
            * width - of original image before resize
            * height - of original image before resize
            * size - area
            * file_size - file size in bytes
            * blur - variance of the laplacian
            * unique - number of unique colors, 0..255
            * mean - mean color 0..255
            * max - max color 0..255
            * min - min color 0..255
            Advanced metris include (for running advanced metrics, run with turi_param='run_advanced_stats=1')
            * contrast
            * rms_contrast - square root of mean sum of stdv/mean per channel
            * mean_rel_intensity_r
            * mean_rel_intensity_b
            * mean_rel_intensity_g
            * mean_hue - transform to HSV and compute mean H
            * mean_saturation - transform to HSV and compute mean S
            * mean_val - transform to HSV and compute mean V
            * edge_density - using canny filter
            * mean_r - mean of R channel
            * mean_g - mean of G channel
            * mean_b - mean of B channel


        slice (str): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.

        max_width (int): Option parameter to select the maximal image width in the report

        descending (bool): Optional parameter to control the order of the metric

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.
            The input is an absolute path to the image and the output is the string to display instead of the filename.

        get_extra_col_func (callable): Optional parameter to allow adding extra columns to the gallery.

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

        work_dir (str): Optional parameter to fastdup work_dir. Needed when stats file is a pd.DataFrame.


    Returns:
        ret (int): 0 in case of success, otherwise 1.
    '''

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        ret = check_params(stats_file, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret

        assert metric in ['blur','size','mean','min','max','unique','stdv', 'file_size','rms_contrast','mean_rel_intensity_r',
                          'mean_rel_intensity_b','mean_rel_intensity_g','contrast','mean_saturation','mean_hue', 'mean_val', 'edge_density','mean_r', 'mean_g','mean_b'], "Unknown metric value: " + metric

        stats_file = load_stats(stats_file, work_dir, kwargs)
        try:
            import matplotlib
        except Exception as ex:
            print("Failed to import matplotlib. Please install matplotlib using 'python3.8 -m pip install matplotlib'")
            fastdup_capture_exception("create_stats_gallery", ex)
            return 1


        ret = do_create_stats_gallery(stats_file, save_path, num_images, lazy_load, get_label_func, metric, slice, max_width,
                                       descending, get_bounding_box_func, get_reformat_filename_func, get_extra_col_func, input_dir, work_dir, kwargs=kwargs)
        fastdup_performance_capture("create_stats_gallery", start_time)
        return ret

    except Exception as e:
        fastdup_capture_exception("create_stats_gallery", e)

fastdup.create_similarity_gallery

Function to create and display a gallery of images computed by the similarity metric. In each table row one query image is
displayed and num_images most similar images are displayed next to it on the right.

In case the dataset is labeled, the user can specify the label using the function get_label_func. In this case a score metric is computed to reflect how similar the query image to the most similar images in terms of class label.
Score 100 means that out of the top k num_images similar images, all similar images are from the same class. Score 0 means that the image is similar only to images which are from different class.
Score 50 means that the query image is similar to the same number of images from the same class and from other classes. The report is sorted by the score metric.
For high quality labeled dataset we expect the score to be high, low score may indicate class label issues.

Parameters:

Name Type Description Default
similarity_file str

csv file with the computed image statistics by the fastdup tool, or a path to the work_dir,
alternatively a pandas dataframe. In case of a pandas dataframe need to set work_dir to point to fastdup work_dir.

required
save_path str

output folder location for the visuals

required
num_images int

Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

20
lazy_load boolean

If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

False
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
slice str

Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
A special value is 'label_score' which is used for comparing both images and labels of the nearest neighbors. The score values are 0->100 where 0 means the query image is only similar to images outside its class, 100 means the query image is only similar to images from the same class.

None
max_width int

Optional param to limit the image width

None
descending bool

Optional param to control the order of the metric

False
get_bounding_box_func callable

Optional parameter to allow plotting bounding boxes on top of the image.
The input is an absolute path to the image and the output is a list of bounding boxes.
Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

None
get_reformat_filename_func callable

Optional parameter to allow changing the presented filename into another string.

None
get_extra_col_func callable

Optional parameter to allow adding extra columns to the report

None
input_dir str

Optional parameter to specify the input directory of webdataset tar files,
in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

None
work_dir str

Optional parameter to fastdup work_dir. Needed when similarity_file is a pd.DataFrame.

None
min_items int

Optional parameter to select components with min_items or more

2
max_items int

Optional parameter to limit the number of items displayed

None

Returns:

Name Type Description
ret pd.DataFrame

similarity dataframe, for each image filename returns a list of top K similar images.
each row has the columns 'from', 'to', 'label' (optional), 'distance'

Source code in fastdup/__init__.py
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
def create_similarity_gallery(similarity_file, save_path, num_images=20, lazy_load=False, get_label_func=None,
                                 slice=None, max_width=None, descending=False, get_bounding_box_func=None,
                                 get_reformat_filename_func=None, get_extra_col_func=None, input_dir=None, work_dir=None,
                                 min_items=2, max_items=None, **kwargs):
    '''

    Function to create and display a gallery of images computed by the similarity metric. In each table row one query image is
    displayed and `num_images` most similar images are displayed next to it on the right.

    In case the dataset is labeled, the user can specify the label using the function `get_label_func`. In this case a `score` metric is computed to reflect how similar the query image to the most similar images in terms of class label.
    Score 100 means that out of the top k num_images similar images, all similar images are from the same class. Score 0 means that the image is similar only to images which are from different class.
    Score 50 means that the query image is similar to the same number of images from the same class and from other classes. The report is sorted by the score metric.
    For high quality labeled dataset we expect the score to be high, low score may indicate class label issues.

    Args:
        similarity_file (str): csv file with the computed image statistics by the fastdup tool, or a path to the work_dir,
            alternatively a pandas dataframe. In case of a pandas dataframe need to set work_dir to point to fastdup work_dir.

        save_path (str): output folder location for the visuals

        num_images (int): Max number of images to display (default = 50). Be careful not to display too many images at once otherwise the notebook may go out of memory.

        lazy_load (boolean): If False, write all images inside html file using base64 encoding. Otherwise use lazy loading in the html to load images when mouse curser is above the image (reduced html file size).

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        slice (str): Optional parameter to select a slice of the outliers file based on a specific label or a list of labels.
            A special value is 'label_score' which is used for comparing both images and labels of the nearest neighbors. The score values are 0->100 where 0 means the query image is only similar to images outside its class, 100 means the query image is only similar to images from the same class.

        max_width (int): Optional param to limit the image width

        descending (bool): Optional param to control the order of the metric

        get_bounding_box_func (callable): Optional parameter to allow plotting bounding boxes on top of the image.
            The input is an absolute path to the image and the output is a list of bounding boxes.
            Each bounding box should be 4 integers: x1, y1, x2, y2. Example of valid bounding box list: [[0, 0, 100, 100]]
            Alternatively, get_bounding_box_func could be a dictionary returning the bounding box list for each filename.
            Alternatively, get_bounding_box_func could be a csv containing index,filename,col_x,row_y,width,height or a work_dir where the file atrain_crops.csv exists

        get_reformat_filename_func (callable): Optional parameter to allow changing the presented filename into another string.

        get_extra_col_func (callable): Optional parameter to allow adding extra columns to the report

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

        work_dir (str): Optional parameter to fastdup work_dir. Needed when similarity_file is a pd.DataFrame.

        min_items (int): Optional parameter to select components with min_items or more

        max_items (int): Optional parameter to limit the number of items displayed

    Returns:
        ret (pd.DataFrame): similarity dataframe, for each image filename returns a list of top K similar images.
            each row has the columns 'from', 'to', 'label' (optional), 'distance'
     '''

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        ret = check_params(similarity_file, num_images, lazy_load, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret

        similarity_file, input_dir = load_dataframe(similarity_file, "similarity", input_dir, work_dir, kwargs, ["from", "to", "distance"])

        ret = do_create_similarity_gallery(similarity_file, save_path, num_images, lazy_load, get_label_func,
            slice, max_width, descending, get_bounding_box_func, get_reformat_filename_func, get_extra_col_func, 
            input_dir,  work_dir, min_items, max_items, kwargs=kwargs)
        fastdup_performance_capture("create_similarity_gallery", start_time)
        return ret

    except Exception as e:
        fastdup_capture_exception("create_similarity_gallery", e)
        return None

fastdup.create_aspect_ratio_gallery

Function to create and display a gallery of aspect ratio distribution.

Parameters:

Name Type Description Default
stats_file str

csv file with the computed image statistics by the fastdup tool, or work_dir path or a pandas dataframe with the stats compouted by fastdup.

required
get_label_func (callable): optional function given an absolute path to an image return the image label.
    Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
    Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

 max_width (int): optional parameter to limit the plot image width

 save_path (str): output folder location for the visuals

 num_images (int): optional number of images to compute the statistics on (default computes on all images)

 slice (str): optional parameter to slice the stats file based on a specific label or a list of labels.

 get_filename_reformat_func (callable): optional function to reformat the filename before displaying it.

input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
    in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1.

Source code in fastdup/__init__.py
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
def create_aspect_ratio_gallery(stats_file, save_path, get_label_func=None, max_width=None, num_images=0, slice=None,
                                get_filename_reformat_func=None, input_dir=None, **kwargs):
    '''
    Function to create and display a gallery of aspect ratio distribution.

    Args:
         stats_file (str): csv file with the computed image statistics by the fastdup tool, or work_dir path or a pandas dataframe with the stats compouted by fastdup.

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

         max_width (int): optional parameter to limit the plot image width

         save_path (str): output folder location for the visuals

         num_images (int): optional number of images to compute the statistics on (default computes on all images)

         slice (str): optional parameter to slice the stats file based on a specific label or a list of labels.

         get_filename_reformat_func (callable): optional function to reformat the filename before displaying it.

        input_dir (str): Optional parameter to specify the input directory of webdataset tar files,
            in case when working with webdataset tar files where the image was deleted after run using turi_param='delete_img=1'

    Returns:
        ret (int): 0 in case of success, otherwise 1.
    '''
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        ret = check_params(stats_file, 1, False, get_label_func, slice, save_path, max_width)
        if ret != 0:
            return ret

        stats_file = load_stats(stats_file, kwargs)

        try:
            import matplotlib
        except Exception as e:
            fastdup_capture_exception("create_aspect_ratio_gallery", e)
            print("Failed to import matplotlib. Please install matplotlib using 'python3.8 -m pip install matplotlib'")
            return 1


        ret = do_create_aspect_ratio_gallery(stats_file, save_path, get_label_func, max_width, num_images, slice, input_dir, kwargs=kwargs)
        fastdup_performance_capture("create_aspect_ratio_gallery", start_time)
        return ret

    except Exception as e:
        fastdup_capture_exception("create_aspect_ratio_gallery", e)

Fastdup classifiers

Given fastdup output compute a baseline lightweight classifier
fastdup.create_knn_classifier

Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.

Parameters:

Name Type Description Default
work_dir str

fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similarities

required
k int

(unused)

required
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

required
threshold float

optional threshold to consider neighbors with similarity larger than threshold
prediction per image to one of the given classes.

None

Returns:

Name Type Description
df pd.DataFrame

List of predictions using knn method

Source code in fastdup/__init__.py
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
def create_knn_classifier(work_dir, k, get_label_func, threshold=None):
    '''
    Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.

    Args:
        work_dir (str): fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similarities
        k (int): (unused)
        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
        threshold (float): optional threshold to consider neighbors with similarity larger than threshold
            prediction per image to one of the given classes.

    Returns:
        df (pd.DataFrame): List of predictions using knn method
    '''
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        from fastdup.confusion_matrix import classification_report

        assert os.path.exists(work_dir), "Failed to find work directory " + work_dir
        assert callable(get_label_func) or isinstance(get_label_func, dict) or (isinstance(get_label_func, str) and os.path.exists(get_label_func)), \
            "Please provide a valid callable function get_label_func, given a filename returns its string label or a list of labels, " \
            "or a dictionary where the key is the absolute file name and the value is the label or list of labels or a labels file with header index,label where" \
            "each row is a label corresponding to the image in the atrain_features_data.csv file"

        if threshold is not None:
            assert threshold >= 0 and threshold <= 1, "Please provide a valid threshold 0->1"

        if isinstance(work_dir, pd.DataFrame):
            df = work_dir
            assert len(df), "Empty dataframe received"
        else:
            if os.path.isdir(work_dir):
                similarity_file = os.path.join(work_dir, FILENAME_SIMILARITY)
            df = pd.read_csv(similarity_file)

        labels_dict = None
        if callable(get_label_func):
            df['to_label'] = df['to'].apply(get_label_func)
        elif isinstance(get_label_func, dict):
            df['to_label'] = df['to'].apply(lambda x: get_label_func.get(x, MISSING_LABEL))
        elif isinstance(get_label_func, str):
            labels_df = pd.read_csv(get_label_func)
            filenames_df = pd.read_csv(os.path.join(work_dir, FILENAME_IMAGE_LIST))
            if len(labels_df) != len(filenames_df):
                print('Error: labels file length does not match the number of images in the similarity file', get_label_func, len(labels_df), len(df))
                return None
            if 'label' not in labels_df.columns:
                print('Error: labels file does not contain a label column', get_label_func)
                return None
            filenames_df['label'] = labels_df['label']
            labels_dict = pd.Series(filenames_df.label.values,index=filenames_df.filename).to_dict()
            df['to_label'] = df['to'].apply(lambda x: labels_dict.get(x, MISSING_LABEL))

        from_list = df.groupby(by='from', axis=0)['to'].apply(list)
        distance_list = df.groupby(by='from', axis=0)['distance'].apply(list)
        to_label_list = df.groupby(by='from', axis=0)['to_label'].apply(list)

        df_from = from_list.to_frame()
        df_dist = distance_list.to_frame()
        df_label = to_label_list.to_frame()

        df_merge = df_from.merge(df_dist, on='from')
        df_merge = df_merge.merge(df_label, on='from')

        if callable(get_label_func):
            df_merge['from_label'] = df_merge.index.map(get_label_func)
        elif isinstance(get_label_func, dict):
            df_merge['from_label'] = df_merge.index.map(lambda x: get_label_func.get(x, MISSING_LABEL))
        elif isinstance(get_label_func, str):
            assert labels_dict is not None
            df_merge['from_label'] = df_merge.index.map(lambda x: labels_dict.get(x, MISSING_LABEL))


        df_merge['top_k'] = df_merge.apply(lambda x:
                                           top_k_label(x['to_label'], x['distance'], k, threshold), axis=1)

        y_values = df_merge['from_label'].tolist()
        p1_values = df_merge['top_k'].tolist()
        filenames = df_merge.index.tolist()
        print(classification_report(y_values, p1_values))
        fastdup_performance_capture("create_knn_classifier", start_time)
        return pd.DataFrame({'filename':filenames, 'prediction':p1_values, 'label':y_values})
    except Exception as ex:
        fastdup_capture_exception("create_knn_classifier", ex)
        return pd.DataFrame({'filename':[]})

fastdup.create_kmeans_classifier

Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.

Parameters:

Name Type Description Default
work_dir str

fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similarities

required
k int

(unused)

required
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

required
threshold float

(unused)

None

Returns:

Name Type Description
df pd.DataFrame

dataframe with filename, label and predicted label. Row per each image

Source code in fastdup/__init__.py
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
def create_kmeans_classifier(work_dir, k, get_label_func, threshold=None):
    '''
    Function to create a knn classifier out of fastdup run. We assume there are existing labels to the datapoints.

    Args:
        work_dir (str): fastdup work_dir, or location of a similarity file, or a pandas DataFrame with the computed similarities
        k (int): (unused)
        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.
        threshold (float): (unused)

    Returns:
        df (pd.DataFrame): dataframe with filename, label and predicted label. Row per each image
    '''
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        from fastdup.confusion_matrix import classification_report

        assert callable(get_label_func) or isinstance(get_label_func, dict) or (isinstance(get_label_func, str) and os.path.exists(get_label_func)), \
            "Please provide a valid callable function get_label_func, given a filename returns its string label or a list of labels, " \
            "or a dictionary where the key is the absolute file name and the value is the label or list of labels or a labels file with header index,label where" \
            "each row is a label corresponding to the image in the atrain_features_data.csv file"

        comps = find_top_components(work_dir, get_label_func, 'visual', slice=None, comp_type='cluster')
        print(comps.columns)
        comps['top_k'] = comps.apply(lambda x:
                                           top_k_label(x['label'], x['distance'], k, threshold=threshold), axis=1)
        files = []
        y_values = []
        p1_values = []
        for i,row in comps.iterrows():
            cluster_label = row['top_k']
            for f,l in zip(row['files'], row['label']):
                files.append(f)
                y_values.append(l)
                p1_values.append(cluster_label)

        print(classification_report(y_values, p1_values))
        fastdup_performance_capture("create_kmeans_classifier", start_time)
        return pd.DataFrame({'prediction':p1_values, 'label':y_values, 'filename':files})

    except Exception as ex:
        fastdup_capture_exception("create_kmeans_classifier", ex)
        return pd.DataFrame({'filename':[]})

Fastdup utilities

Loading the binary feature resulting in fastdup run can be done by fastdup.load_binary_features.

fastdup.load_binary_feature

Python function for loading the stored binary features written by fastdup and their matching filenames and analyzing them in Python.

Parameters:

Name Type Description Default
filename str

The binary feature file location

required
d int

Feature vector length

576

Returns:

Name Type Description
filenames list

A list of with all image file names of length X.

np_array np.array

An np matrix of shape rows x d cols (default d is 576). Each row conform to feature vector os a single image.

Example

import fastdup
file_list, mat_features = fastdup.load_binary(FILENAME_FEATURES)

Source code in fastdup/__init__.py
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
def load_binary_feature(filename, d=576):
    '''
    Python function for loading the stored binary features written by fastdup and their matching filenames and analyzing them in Python.

    Args:
        filename (str): The binary feature file location
        d (int): Feature vector length

    Returns:
        filenames (list): A list of with all image file names of length X.
        np_array (np.array): An np matrix of shape rows x d cols (default d is 576). Each row conform to feature vector os a single image.

    Example:
        >>> import fastdup
        >>> file_list, mat_features = fastdup.load_binary(FILENAME_FEATURES)

    '''

    if not os.path.exists(filename) or not os.path.exists(filename + '.csv'):
        print("Error: failed to find the binary feature file:", filename, ' and the filenames csv file:', filename + '.csv')
        return None
    assert(d > 0), "Feature vector length d has to be larger than zero"

    with open(filename, 'rb') as f:
        data = np.fromfile(f, dtype='<f')

    df = pd.read_csv(filename + '.csv')['filename'].values
    assert df is not None, "Failed to read input file " + filename
    num_images = len(df);
    print('Read a total of ', num_images, 'images')

    data = np.reshape(data, (num_images, d))
    assert data.shape[1] == d
    return list(df), data

fastdup.save_binary_feature

Function for saving data to be used by fastdup. Given a list of images and their matching feature vectors in a numpy array,
function saves data in a format readable by fastdup. This saves the image extraction step, to be used with run_mode=1 namely perform
nearest neighbor model on the feature vectors.

Parameters:

Name Type Description Default
save_path str

Working folder to save the files to

required
filenames list

A list of file location of the images (absolute paths) of length n images

required
np_array np.array

Numpy array of size n x d. Each row is a feature vector of one file.

required

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1

Source code in fastdup/__init__.py
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
def save_binary_feature(save_path, filenames, np_array):
    '''
    Function for saving data to be used by fastdup. Given a list of images and their matching feature vectors in a numpy array,
    function saves data in a format readable by fastdup. This saves the image extraction step, to be used with run_mode=1 namely perform
    nearest neighbor model on the feature vectors.

    Args:
        save_path (str): Working folder to save the files to
        filenames (list): A list of file location of the images (absolute paths) of length n images
        np_array (np.array): Numpy array of size n x d. Each row is a feature vector of one file.

    Returns:
        ret (int): 0 in case of success, otherwise 1

    '''
    fastdup_capture_log_debug_state(locals())

    assert isinstance(save_path, str)  and save_path.strip() != "", "Save path should be a non empty string"
    assert isinstance(filenames, list), "filenames should be a list of image files"
    assert filenames is not None and len(filenames), "filenames should be a non empty list"
    assert isinstance(filenames[0], str), 'filenames should contain strings with the image absolute paths'
    assert isinstance(np_array, np.ndarray),  "np_array should be a numpy array"
    assert np_array.dtype == 'float32', "np_array dtype must be float32. You can generate the array using the" \
                              "command: features = np.zeros((rows, cols), dtype='float32')"
    assert np_array.shape[0] == len(filenames), "np_array should contain rows matching to the filenames list"


    try:
        if not os.path.exists(save_path):
            os.makedirs(save_path)

        df = pd.DataFrame({'filename': filenames})
        local_filename = os.path.join(save_path, 'atrain_features.dat')
        df.to_csv(local_filename + '.csv', index=False)
        bytes_array = np_array.tobytes()
        with open(local_filename, 'wb') as f:
            f.write(bytes_array)
        assert os.path.exists(local_filename), "Failed to save file " + local_filename

    except Exception as ex:
        print("Failed to save to " + save_path + " Exception: " + ex)
        return 1

    return 0

fastdup.generate_sprite_image

Generate a sprite image of images for tensorboard projector. A sprite image is a large image composed of grid of smaller images.

Parameters:

Name Type Description Default
img_list list

list of image filenames (full path)

required
sample_size int

how many images in to plot

required
log_dir str

directory to save the sprite image

required
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
h int

optional requested hight of each subimage

0
w int

optional requested width of each subimage

0
alternative_filename str

optional parameter to save the resulting image to a different name

None
alternative_width int

optional parameter to control the number of images per row

None
max_width int

optional parameter to control the rsulting width of the image

None

Returns:

Name Type Description
path str

path to sprite image

labels list

list of labels

Source code in fastdup/__init__.py
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
def generate_sprite_image(img_list, sample_size, log_dir, get_label_func=None, h=0, w=0, alternative_filename=None, alternative_width = None, max_width=None):
    '''
    Generate a sprite image of images for tensorboard projector. A sprite image is a large image composed of grid of smaller images.

    Parameters:
        img_list (list): list of image filenames (full path)

        sample_size (int):  how many images in to plot

        log_dir (str): directory to save the sprite image

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        h (int): optional requested hight of each subimage

        w (int): optional requested width of each subimage

        alternative_filename (str): optional parameter to save the resulting image to a different name

        alternative_width (int): optional parameter to control the number of images per row

        max_width (int): optional parameter to control the rsulting width of the image

    Returns:
        path (str): path to sprite image
        labels (list): list of labels

    '''
    try:
        assert len(img_list), "Image list is empty"
        assert sample_size > 0
        from fastdup.tensorboard_projector import generate_sprite_image as tgenerate_sprite_image
        ret = tgenerate_sprite_image(img_list, sample_size, log_dir, get_label_func, h=h, w=w,
                                      alternative_filename=alternative_filename, alternative_width=alternative_width, max_width=max_width)
        return ret
    except Exception as ex:
        fastdup_capture_exception("generate_sprite_image", ex)

fastdup.export_to_tensorboard_projector

Export feature vector embeddings to be visualized using tensorboard projector app.

Example

import fastdup
fastdup.run('/my/data/', work_dir='out')
fastdup.export_to_tensorboard_projector(work_dir='out', log_dir='logs')

After data is exporeted run tensorboard projector

%load_ext tensorboard
%tensorboard --logdir=logs

Parameters:

Name Type Description Default
work_dir str

work_dir where fastdup results are stored

required
log_dir str

output dir where tensorboard will read from

required
sample_size int

how many images to view. Default is 900.

900
sample_method str

how to sample, currently 'random' is supported.

'random'
with_images bool

add images to the visualization (default True)

True
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
d int

dimension of the embedding vector. Default is 576.

576
file_list list

Optional parameter to specify a list of files to be used for the visualization. If not specified, filenames are taken from the work_dir/atrain_features.dat.csv file
Note: be careful here as the order of the file_list matters, need to keep the exact same order as the atrain_features.dat.csv file!

None

Returns:

Name Type Description
ret int

0 in case of success, 1 in case of failure

Source code in fastdup/__init__.py
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
def export_to_tensorboard_projector(work_dir, log_dir, sample_size = 900,
                                    sample_method='random', with_images=True, get_label_func=None, d=576, file_list=None):
    '''
    Export feature vector embeddings to be visualized using tensorboard projector app.

    Example:
        >>> import fastdup
        >>> fastdup.run('/my/data/', work_dir='out')
        >>> fastdup.export_to_tensorboard_projector(work_dir='out', log_dir='logs')

        After data is exporeted run tensorboard projector
        >>> %load_ext tensorboard
        >>> %tensorboard --logdir=logs

    Args:
        work_dir (str): work_dir where fastdup results are stored

        log_dir (str): output dir where tensorboard will read from

        sample_size (int): how many images to view. Default is 900.

        sample_method (str): how to sample, currently 'random' is supported.

        with_images (bool): add images to the visualization (default True)

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        d (int): dimension of the embedding vector. Default is 576.

        file_list (list): Optional parameter to specify a list of files to be used for the visualization. If not specified, filenames are taken from the work_dir/atrain_features.dat.csv file
                      Note: be careful here as the order of the file_list matters, need to keep the exact same order as the atrain_features.dat.csv file!
    Returns:
        ret (int): 0 in case of success, 1 in case of failure
    '''

    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        try:
            import tensorflow
            from tensorboard.plugins import projector
        except Exception as ex:
            print('For saving information for tensorboard project you need to install tensorflow. Please pip install tensorflow and tensorbaord and try again')
            fastdup_capture_exception("tensorflow import", ex)
            return 1


        from fastdup.tensorboard_projector import export_to_tensorboard_projector_inner
        if not os.path.exists(work_dir):
            os.mkdir(work_dir)
            assert os.path.exists(work_dir), 'Failed to create work_dir ' + work_dir
        assert os.path.exists(os.path.join(work_dir, 'atrain_features.dat')), f'Faild to find fastdup output {work_dir}atrain_features.dat'
        assert sample_size <= 5000, f'Tensorboard projector is limited by 5000 images'

        imglist, features = load_binary_feature(os.path.join(work_dir, 'atrain_features.dat'), d=d)
        if file_list is not None:
            assert isinstance(file_list, list), 'file_list should be a list of absolute file names given in the same order'
            assert len(file_list) == len(imglist), "file_list should be the same length as imglist got " + str(len(file_list)) + " and " + str(len(imglist))
        export_to_tensorboard_projector_inner(imglist, features, log_dir, sample_size, sample_method, with_images, get_label_func, d=d)
        fastdup_performance_capture("export_to_tensorboard_projector", start_time)

    except Exception as ex:
        fastdup_capture_exception("export_to_tensorboard_projector", ex)

fastdup.export_to_cvat

Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.

Parameters:

Name Type Description Default
files str required
labels str required
save_path str required

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1.

Source code in fastdup/__init__.py
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
def export_to_cvat(files, labels, save_path):
    """
    Function to export a collection of files that needs to be annotated again to cvat batch job format.
    This creates a file named fastdup_label.zip in the directory save_path.
    The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.

    Args:
        files (str):
        labels (str):
        save_path (str):

    Returns:
        ret (int): 0 in case of success, otherwise 1.
    """
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        assert len(files), "Please provide a list of files"
        assert labels is None or isinstance(labels, list), "Please provide a list of labels"

        from fastdup.cvat import do_export_to_cvat
        ret =  do_export_to_cvat(files, labels, save_path)
        fastdup_performance_capture("export_to_cvat", start_time)
        return ret
    except Exception as e:
        fastdup_capture_exception("export_to_cvat", e)

fastdup.export_to_labelImg

Function to export a collection of files that needs to be annotated again to cvat batch job format.
This creates a file named fastdup_label.zip in the directory save_path.
The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.

Parameters:

Name Type Description Default
files str required
labels str required
save_path str required

Returns:

Name Type Description
ret int

0 in case of success, otherwise 1.

Source code in fastdup/__init__.py
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
def export_to_labelImg(files, labels, save_path):
    """
    Function to export a collection of files that needs to be annotated again to cvat batch job format.
    This creates a file named fastdup_label.zip in the directory save_path.
    The files can be retagged in cvat using Tasks -> Add (plus button) -> Create from backup -> choose the location of the fastdup_label.zip file.

    Args:
        files (str):
        labels (str):
        save_path (str):

    Returns:
        ret (int): 0 in case of success, otherwise 1.
    """
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        assert len(files), "Please provide a list of files"
        assert labels is None or isinstance(labels, list), "Please provide a list of labels"

        from fastdup.label_img import do_export_to_labelimg
        ret =  do_export_to_labelimg(files, labels, save_path)
        fastdup_performance_capture("export_to_labelImg", start_time)
        return ret
    except Exception as e:
        fastdup_capture_exception("export_to_labelImg", e)
        return 1

Fastdup utilities to remove images

fastdup.find_top_components

Function to find the largest components of duplicate images

Parameters:

Name Type Description Default
work_dir str

working directory where fastdup.run was run.

required
get_label_func callable

optional function given an absolute path to an image return the image label.
Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

None
group_by str

optional parameter to group by 'visual' or 'label'. When grouping by visual fastdup aggregates visually similar images together.
When grouping by 'label' fastdup aggregates images with the same label together.

'visual'
slice str

optional parameter to slice the results by a specific label. For example, if you want to slice by 'car' then pass 'car' as the slice parameter.

None
threshold float

optional threshold to select only distances larger than the treshold

None
metric str

optional metric to sort by. Valid values are mean,min,max,unique,blur,size

None
descending bool

optional value to sort the components, default is True

True
min_items int

optional value, select only components with at least min_items

None
max_items int

optional value, select only components with at most max_items

None
keyword str

optional, select labels with keyword value inside

None
save_path str

optional, save path

None
comp_type str

optional, either component or cluster

'component'

Returns:

Name Type Description
df pd.DataFrame

of top components. The column component_id includes the component name.
The column files includes a list of all image files in this component.

Source code in fastdup/__init__.py
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
def find_top_components(work_dir, get_label_func=None, group_by='visual', slice=None, threshold=None, metric=None,
                        descending=True, min_items=None, max_items = None, keyword=None,  save_path=None,
                        comp_type="component", **kwargs):
    '''
    Function to find the largest components of duplicate images

    Args:
        work_dir (str): working directory where fastdup.run was run.

        get_label_func (callable): optional function given an absolute path to an image return the image label.
            Image label can be a string or a list of strings. Alternatively, get_label_func can be a dictionary where the key is the absolute file name and the value is the label or list of labels.
            Alternatively, get_label_func can be a filename containing string label for each file. First row should be index,label. Label file should be same length and same order of the atrain_features_data.csv image list file.

        group_by (str): optional parameter to group by 'visual' or 'label'. When grouping by visual fastdup aggregates visually similar images together.
            When grouping by 'label' fastdup aggregates images with the same label together.

        slice (str): optional parameter to slice the results by a specific label. For example, if you want to slice by 'car' then pass 'car' as the slice parameter.

        threshold (float): optional threshold to select only distances larger than the treshold

        metric (str): optional metric to sort by. Valid values are mean,min,max,unique,blur,size

        descending (bool): optional value to sort the components, default is True

        min_items (int): optional value, select only components with at least min_items

        max_items (int): optional value, select only components with at most max_items

        keyword (str): optional, select labels with keyword  value inside

        save_path (str): optional, save path

        comp_type (str): optional, either component or cluster

    Returns:
        df (pd.DataFrame): of top components. The column component_id includes the component name.
            The column files includes a list of all image files in this component.


    '''
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        from .galleries import do_find_top_components
        ret = do_find_top_components(work_dir, get_label_func, group_by, slice, threshold=threshold,
                                      metric=metric, descending=descending, min_items=min_items, max_items = max_items,
                                      keyword=keyword, save_path=save_path, comp_type=comp_type, kwargs=kwargs)
        fastdup_performance_capture("find_top_components", start_time)
        return ret
    except Exception as ex:
        fastdup_capture_exception("find_top_components", ex)

fastdup.delete_components

function to automate deletion of duplicate images using the connected components analysis.

Example:
>>> import fastdup
>>> fastdup.run('/path/to/data', '/path/to/output')
>>> top_components = fastdup.find_top_components('/path/to/output')
>>> delete_components(top_components, None, how = 'one', dry_run = False)

Parameters:

Name Type Description Default
top_components pd.DataFrame

largest components as found by the function find_top_components().

required
to_delete list

a list of integer component ids to delete. On default None which means delete duplicates from all components.

None
how int

either 'all' (deletes all the component) or 'one' (leaves one image and delete the rest of the duplicates)

'one'
dry_run bool

if True does not delete but print the rm commands used, otherwise deletes

True

Returns:

Name Type Description
ret list

list of deleted files

Source code in fastdup/__init__.py
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
def delete_components(top_components, to_delete = None,  how = 'one', dry_run = True):
    '''
    function to automate deletion of duplicate images using the connected components analysis.

        Example:
        >>> import fastdup
        >>> fastdup.run('/path/to/data', '/path/to/output')
        >>> top_components = fastdup.find_top_components('/path/to/output')
        >>> delete_components(top_components, None, how = 'one', dry_run = False)

    Args:
        top_components (pd.DataFrame): largest components as found by the function find_top_components().
        to_delete (list): a list of integer component ids to delete. On default None which means delete duplicates from all components.
        how (int): either 'all' (deletes all the component) or 'one' (leaves one image and delete the rest of the duplicates)
        dry_run (bool): if True does not delete but print the rm commands used, otherwise deletes

    Returns:
        ret (list): list of deleted files

    '''

    try:
        start_time = time.time()
        assert isinstance(top_components, pd.DataFrame), "top_components should be a pandas dataframe"
        assert len(top_components), "top_components should not be enpty"
        assert to_delete is None or isinstance(to_delete, list), "to_delete should be a list of integer component ids"
        if isinstance(to_delete, list):
            assert len(to_delete), "to_delete should not be empty"
            assert isinstance(to_delete[0], int) or isinstance(to_delete[0], np.int64), "to_delete should be a list of integer component ids"
        assert how == 'one' or how == 'all', "how should be one of 'one'|'all'"
        assert isinstance(dry_run, bool)

        if to_delete is None:
            to_delete = top_components['component_id'].tolist()

        total_deleted = []

        for comp in (to_delete):
            subdf = top_components[top_components['component_id'] == comp]
            if (len(subdf) == 0):
                print("Warning: failed to find image files for component id", comp)
                continue

            files = subdf['files'].values[0]
            if (len(files) == 1):
                print('Warning: component id ', comp, ' has no related images, please check..')
                continue

            if (how == 'one'):
                files = files[1:]

            inner_delete(files, how='delete', dry_run=dry_run)
            total_deleted += files

        fastdup_performance_capture("delete_components", start_time)
        return total_deleted
    except Exception as ex:
        fastdup_capture_exception("delete_components", ex)

fastdup.delete_or_retag_stats_outliers

function to automate deletion of outlier files based on computed statistics.

Example

import fastdup
fastdup.run('/my/data/", work_dir="out")
delete 5% of the brightest images and delete 2% of the darkest images
fastdup.delete_or_retag_stats_outliers("out", metric="mean", lower_percentile=0.05, dry_run=False)

It is recommended to run with dry_run=True first, to see the list of files deleted before actually deleting.

Example

This example first find wrong labels using similarity gallery and then deletes anything with score < 51.
Score is in range 0-100 where 100 means this image is similar only to images from the same class label.
Score 0 means this image is only similar to images from other class labels.

import fastdup
df2 = create_similarity_gallery(..., get_label_func=...)
fastdup.delete_or_retag_stats_outliers(df2, metric='score', filename_col = 'from', lower_threshold=51, dry_run=True)

Note: it is possible to run with both lower_percentile and upper_percentile at once. It is not possible to run with lower_percentile and lower_threshold at once since they may be conflicting.

Parameters:

Name Type Description Default
stats_file str
  • folder pointing to fastdup workdir or
  • file pointing to work_dir/atrain_stats.csv file or
  • pandas DataFrame containing list of files giveb in the filename_col column and a metric column.
required
metric str

statistic metric, should be one of "blur", "mean", "min", "max", "stdv", "unique", "width", "height", "size"

required
filename_col str

column name in the stats_file to use as the filename

'filename'
lower_percentile float

lower percentile to use for the threshold. Values are 0->1, where 0.05 means remove 5% of the lowest values.

None
upper_percentile float

upper percentile to use for the threshold. Values are 0->1, where 0.95 means remove 5% of the upper values.

None
lower_threshold float

lower threshold to use for the threshold. Only used if lower_percentile is None.

None
upper_threshold float

upper threshold to use for the threshold. Only used if upper_percentile is None.

None
get_reformat_filename_func callable

Optional parameter to allow changing the filename into another string. Useful in the case fastdup was run on a different folder or machine and you would like to delete files in another folder.

None
dry_run bool

if True does not delete but print the rm commands used, otherwise deletes

True
how str

either 'delete' or 'move' or 'retag'. In case of retag allowed value is retag=labelImg or retag=cvat

'delete'
save_path str

optional. In case of a folder and how == 'retag' the label files will be moved to this folder.

None
work_dir str

optional. In case of stats dataframe, point to fastdup work_dir.

None

Returns:
ret (list): list of deleted files (or moved or retagged files)

Source code in fastdup/__init__.py
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
def delete_or_retag_stats_outliers(stats_file, metric, filename_col = 'filename', label_col=None, lower_percentile=None, upper_percentile=None,
                          lower_threshold=None, upper_threshold=None, get_reformat_filename_func=None, dry_run=True,
                          how='delete', save_path=None, work_dir=None):
    '''
    function to automate deletion of outlier files based on computed statistics.

    Example:
        >>> import fastdup
        >>> fastdup.run('/my/data/", work_dir="out")
        delete 5% of the brightest images and delete 2% of the darkest images
        >>> fastdup.delete_or_retag_stats_outliers("out", metric="mean", lower_percentile=0.05, dry_run=False)

        It is recommended to run with dry_run=True first, to see the list of files deleted before actually deleting.

    Example:
        This example first find wrong labels using similarity gallery and then deletes anything with score < 51.
        Score is in range 0-100 where 100 means this image is similar only to images from the same class label.
        Score 0 means this image is only similar to images from other class labels.
        >>> import fastdup
        >>> df2 = create_similarity_gallery(..., get_label_func=...)
        >>>fastdup.delete_or_retag_stats_outliers(df2, metric='score', filename_col = 'from', lower_threshold=51, dry_run=True)

    Note: it is possible to run with both `lower_percentile` and `upper_percentile` at once. It is not possible to run with `lower_percentile` and `lower_threshold` at once since they may be conflicting.

    Args:
        stats_file (str):
          * folder pointing to fastdup workdir or
          * file pointing to work_dir/atrain_stats.csv file or
          * pandas DataFrame containing list of files giveb in the filename_col column and a metric column.

        metric (str): statistic metric, should be one of "blur", "mean", "min", "max", "stdv", "unique", "width", "height", "size"

        filename_col (str): column name in the stats_file to use as the filename

        lower_percentile (float): lower percentile to use for the threshold. Values are 0->1, where 0.05 means remove 5% of the lowest values.

        upper_percentile (float): upper percentile to use for the threshold. Values are 0->1, where 0.95 means remove 5% of the upper values.

        lower_threshold (float): lower threshold to use for the threshold. Only used if lower_percentile is None.

        upper_threshold (float): upper threshold to use for the threshold. Only used if upper_percentile is None.

        get_reformat_filename_func (callable): Optional parameter to allow changing the  filename into another string. Useful in the case fastdup was run on a different folder or machine and you would like to delete files in another folder.

        dry_run (bool): if True does not delete but print the rm commands used, otherwise deletes

        how (str): either 'delete' or 'move' or 'retag'. In case of retag allowed value is retag=labelImg or retag=cvat

        save_path (str): optional. In case of a folder and how == 'retag' the label files will be moved to this folder.

        work_dir (str): optional. In case of stats dataframe, point to fastdup work_dir.


      Returns:
          ret (list): list of deleted files (or moved or retagged files)

    '''
    try:
        start_time = time.time()
        fastdup_capture_log_debug_state(locals())

        assert isinstance(dry_run, bool)
        assert how == 'delete' or how == 'move' or how == 'retag', "how should be one of 'delete'|'move'|'retag'"
        if how == 'move':
            assert save_path is not None, "When how='move' need to provide save_path to move the files to"

        if lower_threshold is not None and lower_percentile is not None:
            assert False, 'You should only specify one of lower_threshold or lower_percentile'

        if upper_threshold is not None and upper_percentile is not None:
            assert False,  'You should only specify one of upper_threshold or upper_percentile'

        if isinstance(stats_file, pd.DataFrame):
            assert isinstance(work_dir, str) and os.path.exists(work_dir), "When providing pandas dataframe need to set work_dir to point to fastdup work_dir"
            df = stats_file
        else:
            df = load_stats(stats_file, work_dir, {})
        if metric == "score" and metric not in df.columns:
            assert False, "For removing wrong labels created by the create_similarity_gallery() need to run stats_file=df where df is the output of create_similarity_gallery()"


        assert metric in df.columns or metric=='size', f"Unknown metric {metric} options are {df.columns}"
        assert filename_col in df.columns
        if label_col:
            assert label_col in df.columns, f"{label_col} column should be in the stats_file"

        if metric == 'size':
            df['size'] = df.apply(lambda x: x['width'] * x['height'], axis=1)

        if lower_percentile is not None:
            assert lower_percentile >= 0 and lower_percentile <= 1, "lower_percentile should be between 0 and 1"
            lower_threshold = df[metric].quantile(lower_percentile)
        if upper_percentile is not None:
            assert upper_percentile >= 0 and upper_percentile <= 1, "upper_percentile should be between 0 and 1"
            upper_threshold = df[metric].quantile(upper_percentile)

        orig_df = df.copy()
        orig_len = len(df)

        if (lower_threshold is not None):
            print(f"Going to delete any images with {metric} < {lower_threshold}")
            df = orig_df[orig_df[metric] < lower_threshold]
            if (upper_threshold is not None):
                print(f"Going to delete any images with {metric} > {upper_threshold}")
                df = pd.concat([df, orig_df[orig_df[metric] > upper_threshold]], axis=0)
        elif (upper_threshold is not None):
                print(f"Going to delete any images with {metric} > {upper_threshold}")
                df = orig_df[orig_df[metric] > upper_threshold]
        else:
            assert(False), "You should specify either lower_threshold or upper_threshold or lower_percetiel or upper_percentile"


        if orig_len == len(df):
            print('Warning: current request to delete all files, please select a subset of files to delete.', orig_len, len(df))
            print(df[metric].describe(), lower_threshold, upper_threshold)
            return 0
        elif len(df) == 0:
            print('Did not find any items to delete, please check your selection')
            return 0


        if get_reformat_filename_func is None:
            files = df[filename_col].values
        else:
            files = df[filename_col].apply(get_reformat_filename_func).values

        if how == 'delete' or how == 'move':
            return inner_delete(files, how=how, dry_run=dry_run, save_path=save_path)
        elif how.startswith('retag'):
            if label_col is not None:
                label = df[label_col].values
            else:
                label = None
            return inner_retag(files, label, how, save_path)
        else:
            assert(False), "How should be one of 'delete'|'move'|'retag'"

        fastdup_performance_capture("delete_or_retag_stats_outliers", start_time)
        return files
    except Exception as e:
        fastdup_capture_exception("delete_or_retag_stats_outliers", e)