These scripts are used to perform the various tasks described in the rfCon manuscript. For more detailed instructions on running each script, read the included comments in each file

avgVarImportance.py = a script for finding the average variable importance of each category of features
chooseRandomFold.py = for generating 50%off data. Chooses random 50% of the input file's negative and positive classes.
chooseRandomPercent.py = for generating 50%off data. 
chooseRandomPercent_v2.0.py = same as above but maintains imbalance ratios by allowing for independent pos/neg resampling.
combineBlindSdaPreds.py = blind test SDA predection target data generation script for combining raw pred files.
combineSdaPreds.py = SDA prediction target data generation script for combining raw pred training files
convertSdaScoresToFeatures.py = create features files for SDA ensemble by combining the output of individual sda models
equalize_v2.py = target file example class balancer
featureCount.py = count features present
featureRemoval_meanDecreaseAccuracy_sep12.py = performs sep12 feature selection
featureRemoval_meanDecreaseAccuracy_sep24.py = performs sep24 feature selection
featureRemoval_meanDecreaseAccuracy_sep6.py = performs sep6 feature selection
findDupProt.py = find duplicate proteins based on the id tag on example lines
genRegression.py = convert examples to regression examples based on the formula in the rfcon manuscript
get_only_X.py = sda data input conversion
labelSdaEnsemblePreds.py = label the features for sda ensemble predictions (adds comments to the examples)
matchBalancedExamps.py = An efficient script for matching example lines between two different files. used to reproduce training fold files. Capable of scanning the 1TB of total feature files for over 1 million matches in under 30 minutes. 
matchBalancedExamps_v0.py = the same as above but less flexible input options.
meanDecreaseGini_featureRemovalTest_sep6.py = sep6 feature selection with mean decrease gini
multiClassifySVMplus.py = classify many svm targets at once in parallel
multiRF_predict.py = classiy many RF targets at once.
multiSdaPredict.py = classify many Sda targets at once in parallel
multiTopPred.py = converts multiple output prediction files to CASP competition format for easier understanding. 
pickAndLabelRf.py = processes RF output predictions
pickle_edit.py = organize input data folds for sda training. creates python pickle (X only) input data conversion.
pickSinglePred.py = extract only the highest vote (pos/neg) for each sda prediction
pickTopPreds_reg.py = converts output prediction files to CASP competition format for easier understanding. for use only on the sda ensemble final models which use regression based examples (not binary classification).
posneg.py = determine positive and negative ratio of an example or prediction file
removeDuplicates.py = compare two example files for examples derived from matching proteins and remove those cases. generate a third file which is free of these duplicates.
rrEval.py = perform evaluation of a single predictor on a single target
runMultiRrEval.py = multiple target performance evaluation of a predictor
sda_pick_positive.py = sda prediction value processing and extraction of predictions counted as positive.
svmFeaturesToRfFeatures.py = convert features to the random forest input format
targetValFix.py = convert features from svm to sda format (changes target values of -1 to 0)
targetVal_svmConvert.py = convert features from sda to svm format (changes target values of 0 to -1) 
topToArray.py = a script for processing feature selection results to create an array of the top 100 features for easier input into feature selection scripts.
PDB_IDs.txt = a list of all PDB files used in our datasets. The corresponding data for each one can be downloaded from the protein data bank website at https://www.rcsb.org/