[{"content":" Update (2026-03-24): Since the original post, I added three things. First, a Reactome pathway tool that fetches biological pathways for each gene target. Second, a Neo4j graph database for accumulating results across pipeline runs. Reports are ingested as a knowledge graph (diseases, genes, proteins, compounds, papers, pathways) that enables cross-disease queries like \u0026ldquo;which targets overlap between Alzheimer\u0026rsquo;s and Parkinson\u0026rsquo;s?\u0026rdquo; Third, a web frontend at bio.arcosdiaz.com for interactive exploration of the graph (still work in progress). All three are described below.\nI built a multi-agent system in plain python that takes a disease name and autonomously finds potential drug targets by querying public bioinformatics databases. You enter \u0026ldquo;Alzheimer disease\u0026rdquo; as an input and returns a ranked list of targets, each annotated with protein structure data, known compounds, clinical trial progress, and recent literature. I ran it on three diseases and the results matched real-world pharma consensus in every case, without any hardcoded domain knowledge.\nFor Alzheimer\u0026rsquo;s, the system identified APP as the top target (55 known compounds, Phase 3 trials) and flagged APOE as the strongest genetic risk factor but undruggable: zero compounds. For Parkinson\u0026rsquo;s, LRRK2 came out first, which is the kinase that Denali and Biogen are both currently targeting using clinical-stage inhibitors. For schizophrenia, DRD2, because every antipsychotic on the market targets the dopamine D2 receptor.\nThe problem Early-stage drug target identification is basically a cross-referencing exercise. Open Targets tells you which genes are associated with a disease. UniProt tells you what the protein looks like: its family, subcellular location, whether it has a solved 3D structure. ChEMBL tells you if compounds exist that bind it, and how far they have progressed in trials. PubMed tells you what the literature says about the gene-disease association.\nA scientist doing this manually queries each database, copies results into a spreadsheet, and writes a recommendation. I wanted to know if an LLM-coordinated agent system could do this end-to-end and produce results that are biologically valid.\nArchitecture The system uses three specialized agents and one orchestrator.\nThe Gene Hunter queries Open Targets\u0026rsquo; GraphQL API for disease-gene associations and returns the top-ranked genes. Then three tasks run in parallel via asyncio.gather() for each gene: the Druggability Assessor hits UniProt for protein annotations and ChEMBL for compound/bioactivity data, the Literature Validator searches PubMed and gets recent abstracts, and a Reactome pathway lookup fetches the biological pathways the gene is involved in. Each agent uses Gemini 2.5-flash (free tier, but you can use more advanced models) for its reasoning step, interpreting raw protein data into a druggability verdict or classifying literature evidence as supporting, contradicting, or inconclusive.\nThe orchestrator then compiles everything into a ranked report plus recommendation.\nI separated the agents because the required reasoning is different. The Druggability Assessor interprets protein families and compound binding data (pharmacology). The Literature Validator reads abstracts and weighs conflicting evidence (biomedical text analysis). The Reactome lookup is a pure API call with no LLM reasoning, but its pathway data feeds into the final synthesis so the LLM can reason about shared biological mechanisms across targets. Putting all of this in one prompt would make it less specific. The architecture is also modular: adding a Clinical Trials agent would not require touching existing code.\nI deliberately avoided LangGraph and similar agent frameworks. The orchestration logic is just async Python: a few gather() calls and some loops. Pydantic models define the data contracts between agents. I might extend it to add functionality that really requires a such a framework.\nWhat it found Alzheimer disease The system ranked five targets. APP (amyloid beta precursor protein) came out on top: 55 known compounds, Phase 3 trials, direct causal role in early-onset AD through the amyloid pathway. APP has been the central therapeutic hypothesis in Alzheimer\u0026rsquo;s for decades, so this is what you would want to see.\nAPOE is the more interesting result. The system flagged it as the strongest genetic risk factor for late-onset AD but noted zero compounds and zero clinical progress. APOE is a lipid transport protein, and the field has been trying to figure out how to drug it with small molecules for years without success. The system identified that gap from the data alone.\nPSEN1 and PSEN2 were identified as gamma-secretase components (peptidase A22A family) with no clinical-phase compounds, and the system flagged toxicity concerns. This maps onto real history: gamma-secretase inhibitors like semagacestat and avagacestat failed in trials because of toxicity from Notch signaling disruption. The database queries can\u0026rsquo;t directly surface those specific failures, but the protein annotations were enough for the LLM to flag the risk.\nParkinson disease LRRK2 ranked first: a protein kinase with 54 known compounds and Phase 4 annotation. It is the most actively pursued kinase target in Parkinson\u0026rsquo;s right now, with inhibitors from Denali and Biogen in clinical trials.\nSNCA (alpha-synuclein) was noted as \u0026ldquo;intrinsically disordered,\u0026rdquo; meaning it does not have a stable, well-defined 3D structure under physiological conditions and is challenging for conventional small molecule drug design, even though it is central to PD pathology. The system understood that distinction, which is the kind of thing I was hoping it could do: not just retrieve data, but reason about what makes a protein a good target.\nPRKN (Parkin), an E3 ubiquitin ligase with zero compounds, was flagged as a candidate for PROTAC-type approaches (therapeutic effect by inducing degradation of the protein rather than inhibiting its function). Its partner kinase PINK1 was identified with activators entering trials. Both are in the mitophagy pathway (removal of damaged mitochondria), and both calls are reasonable.\nSchizophrenia DRD2: Phase 4 (approved drugs), G-protein coupled receptor, 57 compounds, strong genome-wide association study evidence. Every antipsychotic targets it. Not surprising, so the system should find it.\nSHANK3 was flagged as an intracellular scaffold protein with zero compounds. It is hard to target a protein-protein interaction hub sitting inside the postsynaptic density, and the system said as much.\nDRD3 showed 99 compounds at Phase 2. For example, cariprazine: a DRD3-preferring partial agonist approved for schizophrenia.\nUnder the hood The pipeline is fully async. Each gene\u0026rsquo;s druggability and literature assessments run concurrently, with staggered delays for PubMed\u0026rsquo;s rate limits (3 retries, 2-second backoff on 429 responses).\nEach agent has a separate call_llm() function, which makes testing simple: the test files mock all HTTP calls with respx and all LLM calls with AsyncMock. No live API calls required in the test suite.\nSix Pydantic models define the contracts between agents: GeneAssociation, DruggabilityProfile, LiteratureEvidence, Pathway, TargetReport, ReconReport. The orchestrator composes them into a final report serialized as both JSON (for downstream analysis) and Markdown (for reading). The JSON output can then be loaded into Neo4j using a dedicated loader script that creates a graph of diseases, genes, proteins, compounds, papers, and pathways with merge-safe uniqueness constraints.\nAll five APIs are free. Open Targets, UniProt, ChEMBL, and Reactome need no authentication. PubMed just wants an email address.\nLimitations The system queries only Open Targets for gene-disease associations. GWAS Catalog or DisGeNET would improve coverage. PubMed returns abstracts, not full text, so the literature agent misses nuance in methods sections and supplementary data. The clinical trial integration is limited to ChEMBL\u0026rsquo;s \u0026ldquo;max phase\u0026rdquo; field information: a ClinicalTrials.gov query agent would add real trial design details.\nThe LLM evidence classifications are sometimes too cautious. Several targets got \u0026ldquo;inconclusive\u0026rdquo; when the literature clearly supports their role. But this is a prompt engineering problem, not an architectural one.\nWhat I\u0026rsquo;d build next The Neo4j graph, Reactome integration, and a first version of the web frontend are now in place (see update above). The frontend is still work in progress. The next addition would be a Clinical Trials agent querying ClinicalTrials.gov for actual trial design and status data.\nGitHub repo\n","permalink":"https://arcosdiaz.com/posts/2026-03-19-drug-target-agent/","summary":"\u003cblockquote\u003e\n\u003cp\u003e\u003cstrong\u003eUpdate (2026-03-24):\u003c/strong\u003e Since the original post, I added three things. First, a \u003cstrong\u003eReactome pathway tool\u003c/strong\u003e that fetches biological pathways for each gene target. Second, a \u003cstrong\u003eNeo4j graph database\u003c/strong\u003e for accumulating results across pipeline runs. Reports are ingested as a knowledge graph (diseases, genes, proteins, compounds, papers, pathways) that enables cross-disease queries like \u0026ldquo;which targets overlap between Alzheimer\u0026rsquo;s and Parkinson\u0026rsquo;s?\u0026rdquo; Third, a \u003cstrong\u003eweb frontend\u003c/strong\u003e at \u003ca href=\"https://bio.arcosdiaz.com\"\u003ebio.arcosdiaz.com\u003c/a\u003e for interactive exploration of the graph (still work in progress). All three are described below.\u003c/p\u003e","title":"Building a multi-agent system for drug target discovery"},{"content":"tl;dr I trained three classifiers (Logistic Regression, Random Forest, XGBoost) to predict brain region of origin from GTEx bulk RNA-seq expression profiles across 13 brain regions and 2,642 samples.\nXGBoost did best: 95.1% accuracy (5-fold CV: 94.9 +/- 0.9%), macro-averaged AUROC near 0.99. Cerebellum and spinal cord were classified perfectly (F1 = 1.00). Basal ganglia subregions (caudate, putamen, nucleus accumbens) were hardest to separate (F1 ~ 0.89-0.96), which makes sense given their shared developmental origin. The top discriminative genes are not statistical artefacts. They map onto known neurobiology: RORB (#2, cortical layer IV marker), GAL and TRH (#9 and #19, hypothalamic neuropeptides), and a cluster of cerebellar-specific genes (ARHGEF33, HR, KCNJ6) all appear near the top. Non-coding RNAs (lncRNAs + pseudogenes) make up ~37% of the top 30 features. The brain has the highest proportion of non-coding transcription of any organ, so this isn\u0026rsquo;t surprising. Disclaimer: This was a hobby project. I tried to be rigorous, but these results are an initial exploration, not an exhaustive analysis. The pseudogene hits at the top of the ranking especially need validation to rule out mapping artefacts.\nIntroduction The Genotype-Tissue Expression (GTEx) project provides bulk RNA-seq data across dozens of human tissues, including 13 brain regions. The question I wanted to answer: given a gene expression profile from an unknown brain sample, can we predict which region it came from?\nThe classification accuracy itself matters less to me than what the model learns. If it separates the hypothalamus using neuropeptide genes that neuroendocrinologists have studied for decades, that\u0026rsquo;s reassuring. If it relies on mapping artefacts or batch effects, that\u0026rsquo;s a problem.\nData Expression data and sample metadata were downloaded from GTEx via the recount3 R/Bioconductor package. After filtering for brain samples and removing lowly-expressed genes (\u0026lt; 20% of samples), the final dataset:\nSamples 2,642 Genes 18,731 Brain regions 13 Samples per region 139 (substantia nigra) to 255 (cortex) Expression values are log2(TPM + 1).\nExploratory data analysis PCA PCA on the standardized expression matrix shows that PC1 alone captures 48% of the variance, which is unusually high. It\u0026rsquo;s mostly separating the cerebellum from everything else. The top 50 PCs explain 90.2% of total variance.\nIn the PC1 vs PC2 scatter, the cerebellum (and cerebellar hemisphere) forms a tight, well-separated cluster. The remaining regions overlap more but still show structure: cortical regions cluster together, basal ganglia regions overlap, hypothalamus and spinal cord sit at the edges.\nUMAP UMAP (fitted on the top 30 PCs, n_neighbors=30) resolves the structure better. Most regions form distinct clusters, with the expected exceptions: cortex and frontal cortex overlap a lot, and the three basal ganglia regions bleed into each other.\nRegion similarity A correlation heatmap and hierarchical clustering of mean expression profiles line up with what the dimensionality reduction shows: brain regions cluster according to known neuroanatomy.\nThe cerebellum branches off first (it\u0026rsquo;s the most transcriptionally distinct). Cortical regions cluster together. The three basal ganglia structures are nearest neighbors, which tracks with their shared developmental origin from the lateral ganglionic eminence and their overlapping medium spiny neuron populations.\nClassification Three models were trained on an 80/20 stratified split:\nModel Accuracy F1 (weighted) Logistic Regression 0.934 0.934 Random Forest 0.902 0.902 XGBoost 0.951 0.951 XGBoost won comfortably. 5-fold stratified cross-validation confirmed it: 94.9 +/- 0.9% accuracy, fold scores from 0.939 to 0.962.\nPer-region performance The confusion matrix and per-class metrics break down as you\u0026rsquo;d expect:\nF1 = 1.00: Cerebellum, Cerebellar Hemisphere, Spinal cord F1 \u0026gt; 0.95: Cortex, Hippocampus, Caudate F1 ~ 0.89-0.92: Amygdala, Nucleus accumbens, Putamen The basal ganglia confusion is biologically expected. Caudate, putamen, and nucleus accumbens share cell types and transcriptional programs. Worth noting that Logistic Regression came close (93.4% vs 95.1%), which suggests the expression differences between regions are mostly linearly separable already.\nBiological interpretation I annotated the top 100 discriminative genes (by XGBoost split-gain importance) using the Ensembl REST API and checked whether they match known brain region biology. Short answer: yes.\nWhat the classifier actually learned The top 30 genes fall into recognizable groups:\nHypothalamic neuropeptides (GAL, TRH). Both are textbook hypothalamic markers. GAL (galanin, rank #9) is a major inhibitory neuropeptide concentrated in the hypothalamus, involved in feeding and sleep-wake regulation. TRH (rank #19) is synthesised primarily in the paraventricular nucleus and controls the hypothalamic-pituitary-thyroid axis.\nCerebellar markers (ARHGEF33, HR, KCNJ6, CASD1, FIBCD1). The cerebellum has a cytoarchitecture unlike any other brain region: Purkinje cells, granule cells, Bergmann glia. The model classifies it perfectly, and these genes explain why. ARHGEF33 is overexpressed 12x in cerebellum vs. other regions. HR (Hairless) is required for Purkinje cell structural maintenance. KCNJ6 (GIRK2) is a K+ channel abundant in cerebellar granule cells. The weaver mouse, which carries a Kcnj6 missense mutation, exhibits massive granule cell loss.\nCortical identity (RORB, PPP3CA). RORB at rank #2 is a good sanity check. It\u0026rsquo;s the standard transcriptomic marker for cortical layer IV, used in the Allen Brain Atlas and Human Cell Atlas. PPP3CA (calcineurin) is enriched in cortex and hippocampus and is involved in synaptic plasticity.\nIon channels (KCNJ6, KCNQ4, KCNS1, KCTD3, CABP7). Different brain regions have different resting potentials, firing patterns, and ionic conductances. Four potassium channel genes and one calcium-binding protein in the top 30 encode that variation.\nNon-coding RNAs (6 of top 30). The brain expresses more lncRNAs than any other organ. Several of these are antisense to known neural genes (NCAM1-AS1, UNC5B-AS1), suggesting they\u0026rsquo;re cis-regulatory elements whose expression mirrors the region-specific regulation of their sense-strand partners.\nExpression patterns A z-scored heatmap of the top 30 genes across regions shows distinct, region-specific expression blocks:\nBox plots for individual marker genes match expectations. GAL and TRH are highest in the hypothalamus, RORB in cortical regions, KCNJ6 in cerebellum and substantia nigra:\nCaveats The pseudogene at rank #1 (CDCA4P1) is the result I\u0026rsquo;m least confident about. It could be genuine regulatory transcription, or it could be a mapping artefact. Feature importance here tracks what XGBoost deems as more important for classification, not differential expression magnitude, so a gene can rank high without being highly differentially expressed if it happens to be informative at decision boundaries.\nGTEx bulk RNA-seq also averages across all cell types in each tissue block. Single-cell or spatial transcriptomics would clarify which cell populations actually drive each marker\u0026rsquo;s region specificity. And Random Forest underperformed (90.2%), probably because I didn\u0026rsquo;t tune its hyperparameters. With proper tuning it would likely close the gap.\nConclusion XGBoost gets 95.1% accuracy at predicting brain region from bulk gene expression across 13 GTEx regions. The accuracy is fine, but I care more about the feature ranking. The top genes aren\u0026rsquo;t mysterious: hypothalamic neuropeptides, cerebellar cell-type markers, cortical layer transcription factors, region-specific ion channels. Non-coding RNAs make up ~37% of the top 30, which fits with the brain\u0026rsquo;s unusually complex non-coding transcriptome.\nLogistic Regression reaching 93.4% is maybe the most telling result. Brain regions are already well-separated in expression space. The hard part isn\u0026rsquo;t model complexity. It\u0026rsquo;s the biology at the boundaries, especially among basal ganglia subregions that share developmental origins and cell types.\n","permalink":"https://arcosdiaz.com/posts/2026-03-01-brain-region-classifier/","summary":"\u003ch2 id=\"tldr\"\u003etl;dr\u003c/h2\u003e\n\u003cp\u003eI trained three classifiers (Logistic Regression, Random Forest, XGBoost) to predict brain region of origin from GTEx bulk RNA-seq expression profiles across 13 brain regions and 2,642 samples.\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eXGBoost did best: 95.1% accuracy (5-fold CV: 94.9 +/- 0.9%), macro-averaged AUROC near 0.99.\u003c/li\u003e\n\u003cli\u003eCerebellum and spinal cord were classified perfectly (F1 = 1.00). Basal ganglia subregions (caudate, putamen, nucleus accumbens) were hardest to separate (F1 ~ 0.89-0.96), which makes sense given their shared developmental origin.\u003c/li\u003e\n\u003cli\u003eThe top discriminative genes are not statistical artefacts. They map onto known neurobiology: RORB (#2, cortical layer IV marker), GAL and TRH (#9 and #19, hypothalamic neuropeptides), and a cluster of cerebellar-specific genes (ARHGEF33, HR, KCNJ6) all appear near the top.\u003c/li\u003e\n\u003cli\u003eNon-coding RNAs (lncRNAs + pseudogenes) make up ~37% of the top 30 features. \u003ca href=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC4687686/?utm_source=chatgpt.com\"\u003eThe brain has the highest proportion of non-coding transcription of any organ\u003c/a\u003e, so this isn\u0026rsquo;t surprising.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eDisclaimer:\u003c/strong\u003e This was a hobby project. I tried to be rigorous, but these results are an initial exploration, not an exhaustive analysis. The pseudogene hits at the top of the ranking especially need validation to rule out mapping artefacts.\u003c/p\u003e","title":"Classifying brain regions from gene expression RNA-seq data"},{"content":"- Dashboard Heroku App - Twitter bot @corona7tage All data is based on the official APIs by RKI dashboard and DIVI\n","permalink":"https://arcosdiaz.com/archive/2021-01-01-covid19-germany-dashboard/","summary":"\u003ch2 id=\"--dashboard-heroku-app\"\u003e- \u003ca href=\"https://corona7tage.herokuapp.com\"\u003eDashboard Heroku App\u003c/a\u003e\u003c/h2\u003e\n\u003ch2 id=\"--twitter-bot-corona7tage\"\u003e- \u003ca href=\"https://twitter.com/corona7tage\"\u003eTwitter bot @corona7tage\u003c/a\u003e\u003c/h2\u003e\n\u003cp\u003eAll data is based on the official APIs by \u003ca href=\"https://corona.rki.de\"\u003eRKI dashboard\u003c/a\u003e and \u003ca href=\"https://www.intensivregister.de/#/aktuelle-lage/zeitreihen\"\u003eDIVI\u003c/a\u003e\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e","title":"COVID-19 Germany local incidence and ICU occupancy (in German)"},{"content":"tl;dr I trained 4 different types of models to classify bitcoin transactions. For each, two versions of the feature set were used: all features (local + neighborhood-aggregated) and only local features (without neighborhood information).\nThe best model was a Random Forest trained with all features: its performance was impaired when the aggregated features were removed. The best graph-based neural network model was APPNP and its performance was better when only local features were used. APPNP performed better than an MLP with comparable complexity, indicating that the graph structure information gave it an advantage. Finally, the best GCN model required using all features and several strategies to reduce overfitting. The excellent performance of a Random Forest shows that it makes sense to consider simple models when faced with a new task. It also indicates that the individual node features in the Elliptic dataset are already informative enough to make good predictions. It would be interesting to explore how the model performs, when fewer samples and/or features are available for training.\nA shallow GCN with 2 layers might not be a good choice for node classification of a graph as sparse as the bitcoin transaction graph. If a node has few incoming edges, a graph convolution may not have enough neighbors with features to aggregate.\nAn interesting solution is provided by the APPNP model, which combines message passing with the teleportation principle of personalized pagerank. The long-range (20 iterations in the best model) of the predictions propagation through the network is an aspect that deserves further attention in the future.\nThe main performance metrics for comparison were:\nModel Features Dropout Precision Recall F1 score GCN all 0.5 0.8051 0.4958 0.6137 GCN local 0. 0.6667 0.4617 0.5456 APPNP all 0.2 0.7791 0.6251 0.6936 APPNP local 0. 0.8158 0.6787 0.7409 MLP all 0.2 0.6538 0.6593 0.6565 MLP local 0. 0.7799 0.6740 0.7231 RandomForest all 0.9167 0.7211 0.8072 RandomForest local 0.8749 0.7036 0.7799 Disclaimer: This was a hobby project done mostly nocturnally and on the weekends out of pure fascination for graph theory and neural networks. Although I made every effort to apply scientific rigor, these results constitute an initial exploration and should not be considered an exhaustive analysis. Importantly, due to the random nature of certain parameters (e.g. dropout), multiple repetitions of the experiments using different random seeds and/or different validation splits are necessary for a conclusive judgement.\n#hide_input import os import pandas as pd import seaborn as sns import matplotlib import matplotlib.pyplot as plt %matplotlib inline matplotlib.rcParams[\u0026#39;figure.dpi\u0026#39;] = 300 path = os.path.realpath(\u0026#39;.\u0026#39;) runs_config = pd.read_csv(path+\u0026#39;/experiments_summary.csv\u0026#39;) runs_metrics = pd.read_csv(path+\u0026#39;/experiments_metrics.csv\u0026#39;) runs = runs_metrics.merge(runs_config, left_on=\u0026#39;name\u0026#39;, right_on=\u0026#39;name\u0026#39;, suffixes=(\u0026#39;\u0026#39;,\u0026#39;_\u0026#39;)) runs.rename(columns={\u0026#39;_step\u0026#39;:\u0026#39;epoch\u0026#39;}, inplace=True) runs[\u0026#39;nobias\u0026#39;] = runs[\u0026#39;nobias\u0026#39;].astype(str) runs[\u0026#39;dropout\u0026#39;] = runs[\u0026#39;dropout\u0026#39;].astype(float) runs[\u0026#39;k\u0026#39;] = runs[\u0026#39;k\u0026#39;].astype(str) runs[\u0026#39;alpha\u0026#39;] = runs[\u0026#39;alpha\u0026#39;].astype(str) runs.query(\u0026#39;nhidden==\u0026#34;100\u0026#34; and _step_\u0026gt;=999.0\u0026#39;, inplace=True) query = \u0026#39;\u0026#39;\u0026#39;((onlylocal==True and dropout==0) or (onlylocal==False and (dropout==0.2 or dropout==0.5))) and ((model==\u0026#34;appnp\u0026#34; and k==\u0026#34;20.0\u0026#34; and alpha==\u0026#34;0.2\u0026#34; and (dropout==0. or dropout==0.2)) or (model==\u0026#34;mlp\u0026#34; and (dropout==0. or dropout==0.2)) or (model==\u0026#34;gcn\u0026#34; and (dropout==0. or dropout==0.5))) and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34;\u0026#39;\u0026#39;\u0026#39;.replace(\u0026#39;\\n\u0026#39;,\u0026#39; \u0026#39;) g1 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, col=\u0026#39;onlylocal\u0026#39;, hue=\u0026#39;model\u0026#39;, style=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); g1.axes.flat[0].axhline(0.8072, c=\u0026#39;k\u0026#39;, alpha=0.8, ls=\u0026#39;-.\u0026#39;, lw=1) g1.axes.flat[0].text(1,0.815,\u0026#39;RandomForest\u0026#39;) g1.axes.flat[1].axhline(0.7799, c=\u0026#39;k\u0026#39;, alpha=0.8, ls=\u0026#39;-.\u0026#39;, lw=1) g1.axes.flat[1].text(1,0.785,\u0026#39;RandomForest\u0026#39;) plt.suptitle(\u0026#39;Performance of the best models of each class using all features vs. only local features\u0026#39;, y=1.02); Introduction The Elliptic Data Set consists of anonymized transactions collected from the bitcoin exchange during 49 distinct time-periods. The transactions are represented as a graph containing 203769 nodes (transactions) and 234355 edges (bitcoin flow from one transaction to another). A subset of the transactions are labeled as licit or illicit. A detailed description of the dataset and an initial approach applying graph convolutional networks (GCNs) for the task of node classification has been addressed by:\nM. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robinson, C. E. Leiserson, \u0026ldquo;Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics\u0026rdquo;, KDD ’19 Workshop on Anomaly Detection in Finance, August 2019, Anchorage, AK, USA.\nIn this notebook, I will take a closer look to how graph-based neural networks can be applied to this task and propose possible directions for future analyses.\nDue to the longer training times and for reproducibility, the experiments were all run using the script in models.py and all runs and metrics were tracked on Weights\u0026amp;Biases. All results were exported to a csv file using this script and loaded onto this notebook for visualization.\n#collapse-hide import os import random import time import dgl import networkx as nx import numpy as np import pandas as pd import torch import torch.nn as nn import torch.nn.functional as F from dgl.nn.pytorch import GraphConv from sklearn.metrics import confusion_matrix, precision_recall_fscore_support import matplotlib import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline matplotlib.rcParams[\u0026#39;figure.dpi\u0026#39;] = 300 # set random seeds seed = 0 random.seed(seed) np.random.seed(seed) dgl.random.seed(seed) torch.manual_seed(seed); path = os.path.realpath(\u0026#39;.\u0026#39;) #collapse-hide # load experiment results exported from Weights\u0026amp;Biases runs_config = pd.read_csv(path+\u0026#39;/experiments_summary.csv\u0026#39;) runs_metrics = pd.read_csv(path+\u0026#39;/experiments_metrics.csv\u0026#39;) runs = runs_metrics.merge(runs_config, left_on=\u0026#39;name\u0026#39;, right_on=\u0026#39;name\u0026#39;, suffixes=(\u0026#39;\u0026#39;,\u0026#39;_\u0026#39;)) runs.rename(columns={\u0026#39;_step\u0026#39;:\u0026#39;epoch\u0026#39;}, inplace=True) runs[\u0026#39;nobias\u0026#39;] = runs[\u0026#39;nobias\u0026#39;].astype(str) runs[\u0026#39;dropout\u0026#39;] = runs[\u0026#39;dropout\u0026#39;].astype(float) runs[\u0026#39;k\u0026#39;] = runs[\u0026#39;k\u0026#39;].astype(str) runs[\u0026#39;alpha\u0026#39;] = runs[\u0026#39;alpha\u0026#39;].astype(str) runs.query(\u0026#39;nhidden==\u0026#34;100\u0026#34; and _step_\u0026gt;=999.0\u0026#39;, inplace=True) Transaction data Three tables are initially available to download from Kaggle\u0026rsquo;s dataset repository:\nAn edgelist: the edges between bitcoin transactions (nodes identified by transaction id) necessary to build the graph A classes table: label for each transaction can be licit, illicit, or unknown A features table with 167 columns Transaction id Timestep: consecutive periods of time for which all bitcoin flows are translated to edges in a graph Edges exist only between transactions within the same timestep 93 local features, i.e. intrinsic properties of the transactions themselves such as amount, transaction fee, etc. 72 aggregated features with information about the immediate neighborhood of each node, e.g. sum of amounts of the neighboring transactions # load data df_edges = pd.read_csv(path + \u0026#34;/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv\u0026#34;) df_classes = pd.read_csv(path + \u0026#34;/elliptic_bitcoin_dataset/elliptic_txs_classes.csv\u0026#34;) df_features = pd.read_csv( path + \u0026#34;/elliptic_bitcoin_dataset/elliptic_txs_features.csv\u0026#34;, header=None ) # rename the classes to ints that can be handled by pytorch as labels df_classes[\u0026#34;label\u0026#34;] = df_classes[\u0026#34;class\u0026#34;].replace( {\u0026#34;unknown\u0026#34;: -1, # unlabeled nodes \u0026#34;2\u0026#34;: 0, # labeled licit nodes #\u0026#34;1\u0026#34;: 1, # labeled illicit nodes } ).astype(int) # rename features according to data description in paper rename_dict = dict( zip( range(0, 167), [\u0026#34;txId\u0026#34;, \u0026#34;time_step\u0026#34;] + [f\u0026#34;local_{i:02d}\u0026#34; for i in range(1, 94)] + [f\u0026#34;aggr_{i:02d}\u0026#34; for i in range(1, 73)], ) ) df_features.rename(columns=rename_dict, inplace=True) # check missing data print(f\u0026#34;Number of missing data points: {df_features.isna().sum().sum()+df_classes.isna().sum().sum()}\u0026#34;) print(f\u0026#34;Number of nodes (transactions): {df_features[\u0026#39;txId\u0026#39;].nunique()}\u0026#34;) print(f\u0026#34;Number of edges: {df_edges.shape[0]}\u0026#34;) print(f\u0026#34;Number of classes: {df_classes[\u0026#39;class\u0026#39;].nunique()}\u0026#34;) print(f\u0026#34;Timesteps range from {df_features[\u0026#39;time_step\u0026#39;].min()} to {df_features[\u0026#39;time_step\u0026#39;].max()}\u0026#34;) Number of missing data points: 0 Number of nodes (transactions): 203769 Number of edges: 234355 Number of classes: 3 Timesteps range from 1 to 49 Correlation analysis Additionally, the dataset was analyzed using the handy pandas-profiling package. The complete script for the analysis is in eda.py, which generates a detailed report including multiple correlations. The main findings from the report can be summarized as:\n29 features are highly skewed 76 features are highly correlated to other features in the dataset (Spearman correlation coefficient $\\rho \u0026gt; 0.90$) 21 aggregated features are highly correlated to other aggregated features 54 local features are highly correlated to other local features time_step is highly correlated with aggr_43 ($\\rho = 0.91$) Constructing the transaction graph We now have our data prepared in table format, but we want to be able to work on the graph constructed from the data. In order to create our transaction graph, we use the networkx package. We create a directed multigraph (a directed graph that allows for multiple edges between two nodes) and add the label attribute to each transaction.\n# create networkx graph from the pandas dataframes g_nx = nx.MultiDiGraph() g_nx.add_nodes_from( zip(df_classes[\u0026#34;txId\u0026#34;], [{\u0026#34;label\u0026#34;: v} for v in df_classes[\u0026#34;label\u0026#34;]]) ) g_nx.add_edges_from(zip(df_edges[\u0026#34;txId1\u0026#34;], df_edges[\u0026#34;txId2\u0026#34;])); print(f\u0026#34;Graph with {g_nx.number_of_nodes()} nodes and {g_nx.number_of_edges()} edges.\u0026#34;) print(f\u0026#34;Number of connected components: {len(list(nx.weakly_connected_components(g_nx)))}\u0026#34;) Graph with 203769 nodes and 234355 edges. Number of connected components: 49 We can confirm that there are 49 connected components (weakly connected compoments in the case of directed graphs) was constructed for each timestep. This means that the dataset consists of 49 different subgraphs, each corresponding to one timestep.\n# create list of graphs, one for each timestep components = list(nx.weakly_connected_components(g_nx)) g_nx_t_list = [g_nx.subgraph(components[i]) for i in range(0,len(components))] with sns.axes_style(\u0026#39;white\u0026#39;): fig, ax = plt.subplots(1,2, figsize=(12,6)) for i,t in enumerate([26,48]): node_label = list(nx.get_node_attributes(g_nx_t_list[t], \u0026#39;label\u0026#39;).values()) mapping = {-1:\u0026#39;grey\u0026#39;, 0:\u0026#39;C0\u0026#39;, 1:\u0026#39;C3\u0026#39;} node_color = [mapping[l] for l in node_label] nx.draw_networkx(g_nx_t_list[t], node_size=10, node_color=node_color, with_labels=False, width=0.2, alpha=0.8, arrowsize=8, ax=ax[i]) leg = ax[0].legend([\u0026#39;unlabeled\u0026#39;, \u0026#39;licit\u0026#39;, \u0026#39;illicit\u0026#39;]) leg.legendHandles[0].set_color(\u0026#39;grey\u0026#39;) leg.legendHandles[1].set_color(\u0026#39;C0\u0026#39;) leg.legendHandles[2].set_color(\u0026#39;C3\u0026#39;) plt.tight_layout() /anaconda3/envs/dgl/lib/python3.7/site-packages/networkx/drawing/nx_pylab.py:579: MatplotlibDeprecationWarning: The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead. if not cb.iterable(width): /anaconda3/envs/dgl/lib/python3.7/site-packages/networkx/drawing/nx_pylab.py:676: MatplotlibDeprecationWarning: The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead. if cb.iterable(node_size): # many node sizes We can see that most of the transactions are not labeled and that only a minority of the labeled nodes correspond to illicit transactions. Moreover, there graph does not seem to be particularly dense. There seem to be chains of transactions one after the other. Also, many of these chains seem to concentrate one type of labeled transaction: either licit or illicit. Finally, a minority of nodes seems to have a larger number of edges, connecting with transactions in multiple chains or further away from their inmediate neighborhood.\nGraph metrics We can calculate selected graph metrics to better quantify some important structural properties of the transaction graph.\ng_metrics = {} g_metrics[\u0026#39;timestep\u0026#39;] = np.arange(1,50) g_metrics[\u0026#39;number_of_nodes\u0026#39;] = [graph.number_of_nodes() for graph in g_nx_t_list] g_metrics[\u0026#39;avg_degree\u0026#39;] = [np.mean(list(dict(nx.degree(graph)).values())) for graph in g_nx_t_list] g_metrics[\u0026#39;density\u0026#39;] = [nx.density(graph) for graph in g_nx_t_list] g_metrics[\u0026#39;avg_clustering\u0026#39;] = [nx.average_clustering(nx.DiGraph(graph)) for graph in g_nx_t_list] g_metrics[\u0026#39;avg_shortest_path\u0026#39;] = [nx.average_shortest_path_length(nx.DiGraph(graph)) for graph in g_nx_t_list] fig, ax = plt.subplots(len(g_metrics)-1,1, figsize=(10,6), sharex=True) for i,label in enumerate(list(g_metrics.keys())[1:]): ax[i].bar(g_metrics[\u0026#39;timestep\u0026#39;], g_metrics[label], label=label) ax[i].legend() plt.xlabel(\u0026#39;timestep\u0026#39;); print(f\u0026#34;Average density of the graphs across all timesteps: {np.mean(g_metrics[\u0026#39;density\u0026#39;]):.6f}\u0026#34;) print(f\u0026#34;Average degree of all nodes across all timesteps: {np.mean(list(dict(nx.degree(g_nx)).values())):.2f}\u0026#34;) Average density of the graphs across all timesteps: 0.000318 Average degree of all nodes across all timesteps: 2.30 Given the density of the transaction graph, it seems to be rather sparse. The average density accross all timesteps lies around 0.0003177 and each node has an average of 2.30 edges. In comparison, the Cora dataset (popularly used as a benchmark of node classification algorithms) has an average degree of 3.90 and a density of 0.0014812. With only 2708 nodes, Cora is a much smaller and denser graph.\nTraining a Graph Convolutional Network To build and train the GCN, I used DGL as a framework for deep learning on graphs DGL is based on pytorch and uses DGLGraph objects that can be easily created from networkx graphs. Moreover, several implementations of graph-based neural layers are available in DGL.\nGraph creation First we create the DGLGraph from a networkx Graph. We also add the label information as a tensor to the node data in the DGLGraph (from now on simply \u0026ldquo;graph\u0026rdquo;). Similarly, we add the node feature matrix to the graph as a tensor of shape (number of nodes, number of features) = (203769, 166).\nImportantly, I tested the performance of the GCN using two options for constructing the graph:\nUnidirectional edges: edges going from one transaction to the next. In a GCN, this would mean that information used by the model to classify nodes would flow in one direction (downstream) only. In this case, we use g_nx directly. Bidirectional edges: edges are made bidirectional, which would allow information flow in a GCN to travel in both directions (downstream and upstream). In this case, we use g_nx.to_undirected().to_directed(), i.e. we first make the edges undirected, and then back again directed. By doing so, networkx makes the edges in the resulting graph bidirectional. It is not explicitely stated in the paper, which of these two options was used by Weber et al. However, from the performance metrics, it is likely that they used the bidirectional edges version.\n# create unidirectional graph g = dgl.DGLGraph() g.from_networkx(g_nx) g.ndata[\u0026#34;label\u0026#34;] = torch.tensor( df_classes.set_index(\u0026#34;txId\u0026#34;).loc[sorted(g_nx.nodes()), \u0026#34;label\u0026#34;].values ) g.ndata[\u0026#34;features_matrix\u0026#34;] = torch.tensor( df_features.set_index(\u0026#34;txId\u0026#34;).loc[sorted(g_nx.nodes()), :].values ) print(g) DGLGraph(num_nodes=203769, num_edges=234355, ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'features_matrix': Scheme(shape=(166,), dtype=torch.float64)} edata_schemes={}) # make unidirectional edges bidirectional in networkx g_nx_bidirectional = g_nx.to_undirected().to_directed() # create bidirectional graph g_bi = dgl.DGLGraph() g_bi.from_networkx(g_nx_bidirectional) g_bi.ndata[\u0026#34;label\u0026#34;] = torch.tensor( df_classes.set_index(\u0026#34;txId\u0026#34;).loc[sorted(g_nx.nodes()), \u0026#34;label\u0026#34;].values ) g_bi.ndata[\u0026#34;features_matrix\u0026#34;] = torch.tensor( df_features.set_index(\u0026#34;txId\u0026#34;).loc[sorted(g_nx.nodes()), :].values ) print(g_bi) DGLGraph(num_nodes=203769, num_edges=468710, ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'features_matrix': Scheme(shape=(166,), dtype=torch.float64)} edata_schemes={}) Graph normalizaiton The performance of the GCN benefits from using a normalized version of the adjacency matrix. I applied a common normalization approach used in the literature. It starts by adding a self-loop to each node, which is the equivalent of adding the identity matrix to the adjacency matrix A $$\\tilde{A} = A + I$$The rest of the normalization consists of obtaining the normalized graph Laplacian (here is a great overview explaining what the Laplacian means), which can be calculated as $$\\hat{A} = D^{-1/2}\\tilde{A}D^{-1/2}$$ where $D$ is a matrix whose diagonal contains the degree of each node of $\\tilde{A}$.\nThis matrix multiplication notation effectively means that the normalized Laplacian can have as a value in each $(i,j)$ position: $$\\hat{A}(i,j) = \\begin{cases} 1 \u0026 \\text{if $i=j$} \\\\[2ex] {-1 \\over {\\sqrt{\\deg(i)\\deg(j)}}} \u0026 \\text{if $i \\neq j$ and $(i,j) \\in E$},\\\\[2ex] 0 \u0026 \\text{otherwise} \\end{cases}$$ where $E$ is the set of edges of the graph. As we can see, nodes with an (in)degree of zero could be troublesome, which is why we add the self-loops.\n# add self loop g.add_edges(g.nodes(), g.nodes()) print(g) # add self loop to the bidirectional edges graph g_bi.add_edges(g_bi.nodes(), g_bi.nodes()) print(g_bi) DGLGraph(num_nodes=203769, num_edges=438124, ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'features_matrix': Scheme(shape=(166,), dtype=torch.float64)} edata_schemes={}) DGLGraph(num_nodes=203769, num_edges=672479, ndata_schemes={'label': Scheme(shape=(), dtype=torch.int64), 'features_matrix': Scheme(shape=(166,), dtype=torch.float64)} edata_schemes={}) Train/validation data splitting Now that our graph is ready, we need to split our data into training and validation sets. We will follow the same approach taken by Weber et al., which consists in a time-based split using the initial 70% of the timesteps to train the model and the remaining 30% for validation. This temporal split makes sense in the context of possible applications of such a model. A company could continuosly use their older historical data and labels for training and then predict the node classes for more recent transactions.\nfeatures = g.ndata[\u0026#34;features_matrix\u0026#34;].float() labels = g.ndata[\u0026#34;label\u0026#34;].long() # format required for cross entropy loss in_feats = features.shape[1] n_classes = 2 # licit or illicit (unknown label is ignored) n_edges = g.number_of_edges() dataset_size = df_classes[\u0026#34;label\u0026#34;].notna().sum() train_ratio = 0.7 train_time_steps = round(len(np.unique(features[:, 0])) * train_ratio) shutdown_timestep = 43 train_indices = ( ((features[:, 0] \u0026lt;= train_time_steps) \u0026amp; (labels != -1)).nonzero().view(-1) ) val_indices = ( ((features[:, 0] \u0026gt; train_time_steps) \u0026amp; (labels != -1)).nonzero().view(-1) ) # timestep_indices = { # t:( # ((features[:, 0] == t) \u0026amp; (labels != -1)).nonzero().view(-1) # ) for t in range(train_time_steps+1,50) # } print(f\u0026#34;\u0026#34;\u0026#34;Number of timesteps used for training: {train_time_steps} Number of timesteps used for validation: {dataset_size-train_time_steps}\u0026#34;\u0026#34;\u0026#34;) Number of timesteps used for training: 34 Number of timesteps used for validation: 203735 GCN model architecture For the GCN model, I used the implementation from DGL, which is based on the original implementation used by Kipf et al. (2016). In short, the algorithm is given by the formula\n$$H^{(l+1)} = \\sigma (D^{-1/2}\\tilde{A}D^{-1/2}H^{(l)}W^{(l)})$$where $\\sigma$ is the activation function (ReLu), $D^{-1/2}\\tilde{A}D^{-1/2}$ is the normalized graph Laplacian, $H^{(l)}$ are the logits and $W^{(l)}$ are the learnable weights of the $l$th layer of the neural network. It is also possible to add a learnable bias to each layer. There are some very good explanations of how this algorithm works available for reference.\nclass GCN(nn.Module): def __init__( self, g, in_feats, n_hidden, n_classes, n_layers, activation, dropout, bias ): super(GCN, self).__init__() self.g = g self.layers = nn.ModuleList() # input layer self.layers.append( GraphConv(in_feats, n_hidden, activation=activation, bias=bias) ) # hidden layers for _ in range(n_layers - 2): self.layers.append( GraphConv(n_hidden, n_hidden, activation=activation, bias=bias) ) # output layer self.layers.append(GraphConv(n_hidden, n_classes, bias=bias)) self.dropout = nn.Dropout(p=dropout) def forward(self, features): h = features for i, layer in enumerate(self.layers): if i != 0: h = self.dropout(h) h = layer(self.g, h) return h # utility function to evaluate the model def evaluate(model, loss_fcn, features, labels, mask): \u0026#34;\u0026#34;\u0026#34;Calculate the loss, accuracy, precision, recall and f1_score for the masked data\u0026#34;\u0026#34;\u0026#34; model.eval() with torch.no_grad(): logits = model(features) logits = logits[mask] labels = labels[mask] loss = loss_fcn(logits, labels) _, indices = torch.max(logits, dim=1) correct = torch.sum(indices == labels) p, r, f, _ = precision_recall_fscore_support(labels, indices) return loss, correct.item() * 1.0 / len(labels), p[1], r[1], f[1] # utility function to obtain a confusion matrix def eval_confusion_matrix(model, features, labels, mask): model.eval() with torch.no_grad(): logits = model(features) logits = logits[mask] labels = labels[mask] _, indices = torch.max(logits, dim=1) print(confusion_matrix(labels, indices)) Model training Now we can train the model using the specifications from the paper by Weber et al.:\ncross entropy loss function putting higher weight for the positive (illicit) samples: 0.7 positive vs 0.3 negative adam optimizer with learning rate 1e-3 no weight decay is mentioned no bias is mentioned no dropout train for 1000 epochs two Graph Convolutional layers with 100 and 2 neurons respectively We can now define the Graph Convolutional Network architecture using DGL. In this case, it consists of n_layers GCN layers with n_hidden hidden units per layer:\n# train and evaluate the model def train_eval_model(model_class, g, features, **params): #bidirectional = params[\u0026#34;bidirectional\u0026#34;] if \u0026#34;bidirectional\u0026#34; in params else None in_feats = features.shape[1] n_classes = 2 n_hidden = params[\u0026#34;n_hidden\u0026#34;] n_layers = params[\u0026#34;n_layers\u0026#34;] weight_decay = params[\u0026#34;weight_decay\u0026#34;] bias = params[\u0026#34;bias\u0026#34;] dropout = params[\u0026#34;dropout\u0026#34;] epochs = params[\u0026#34;epochs\u0026#34;] lr = params[\u0026#34;lr\u0026#34;] posweight = params[\u0026#34;posweight\u0026#34;] model = model_class(g, in_feats, n_hidden, n_classes, n_layers, F.relu, dropout, bias) # weighted cross entropy loss function loss_fcn = torch.nn.CrossEntropyLoss( weight=torch.tensor([1 - posweight, posweight]) ) # use optimizer optimizer = torch.optim.Adam( model.parameters(), lr=lr, weight_decay=weight_decay ) dur = [] metrics = {\u0026#34;loss\u0026#34;:{\u0026#34;train\u0026#34;: [], \u0026#34;val\u0026#34;: []}, \u0026#34;accuracy\u0026#34;:{\u0026#34;train\u0026#34;: [], \u0026#34;val\u0026#34;: []}, \u0026#34;precision\u0026#34;:{\u0026#34;train\u0026#34;: [], \u0026#34;val\u0026#34;: []}, \u0026#34;recall\u0026#34;:{\u0026#34;train\u0026#34;: [], \u0026#34;val\u0026#34;: []}, \u0026#34;f1_score\u0026#34;:{\u0026#34;train\u0026#34;: [], \u0026#34;val\u0026#34;: []}, } for epoch in range(epochs): model.train() if epoch \u0026gt;= 3: t0 = time.time() # forward pass logits = model(features) loss = loss_fcn(logits[train_indices], labels[train_indices]) metrics[\u0026#34;loss\u0026#34;][\u0026#34;train\u0026#34;].append(loss) # backward pass optimizer.zero_grad() loss.backward() optimizer.step() # duration if epoch \u0026gt;= 3: dur.append(time.time() - t0) # evaluate on training set _, train_acc, train_precision, train_recall, train_f1_score = evaluate( model, loss_fcn, features, labels, train_indices ) metrics[\u0026#34;accuracy\u0026#34;][\u0026#34;train\u0026#34;].append(train_acc) metrics[\u0026#34;precision\u0026#34;][\u0026#34;train\u0026#34;].append(train_precision) metrics[\u0026#34;recall\u0026#34;][\u0026#34;train\u0026#34;].append(train_recall) metrics[\u0026#34;f1_score\u0026#34;][\u0026#34;train\u0026#34;].append(train_f1_score) # evaluate on validation set val_loss, val_acc, val_precision, val_recall, val_f1_score = evaluate( model, loss_fcn, features, labels, val_indices ) metrics[\u0026#34;loss\u0026#34;][\u0026#34;val\u0026#34;].append(val_loss) metrics[\u0026#34;accuracy\u0026#34;][\u0026#34;val\u0026#34;].append(val_acc) metrics[\u0026#34;precision\u0026#34;][\u0026#34;val\u0026#34;].append(val_precision) metrics[\u0026#34;recall\u0026#34;][\u0026#34;val\u0026#34;].append(val_recall) metrics[\u0026#34;f1_score\u0026#34;][\u0026#34;val\u0026#34;].append(val_f1_score) if (epoch + 1) % 100 == 0: print( f\u0026#34;Epoch {epoch:05d} | Time(s) {np.mean(dur):.2f} | val_loss {val_loss.item():.4f} \u0026#34; f\u0026#34;| Precision {val_precision:.4f} | Recall {val_recall:.4f} | Acc {val_acc:.4f} \u0026#34; f\u0026#34;| F1_score {val_f1_score:.4f}\u0026#34; ) print(\u0026#34;Confusion matrix:\u0026#34;) eval_confusion_matrix(model, features, labels, val_indices) return model, metrics # GCN model parameters params = { \u0026#34;n_hidden\u0026#34; : 100, \u0026#34;n_layers\u0026#34; : 2, \u0026#34;weight_decay\u0026#34; : 0., \u0026#34;bias\u0026#34; : False, \u0026#34;dropout\u0026#34; : 0., \u0026#34;epochs\u0026#34; : 1000, \u0026#34;lr\u0026#34; : 1e-3, \u0026#34;posweight\u0026#34; : 0.7, } # train on graph with unidirectional edges model, metrics = train_eval_model(GCN, g, features, **params) Epoch 00099 | Time(s) 0.60 | val_loss 0.3580 | Precision 0.2701 | Recall 0.5125 | Acc 0.8783 | F1_score 0.3537 Epoch 00199 | Time(s) 0.60 | val_loss 0.3381 | Precision 0.3649 | Recall 0.4589 | Acc 0.9130 | F1_score 0.4065 Epoch 00299 | Time(s) 0.59 | val_loss 0.3537 | Precision 0.4101 | Recall 0.4358 | Acc 0.9226 | F1_score 0.4226 Epoch 00399 | Time(s) 0.58 | val_loss 0.4015 | Precision 0.4039 | Recall 0.4192 | Acc 0.9221 | F1_score 0.4114 Epoch 00499 | Time(s) 0.58 | val_loss 0.4556 | Precision 0.3895 | Recall 0.3989 | Acc 0.9203 | F1_score 0.3942 Epoch 00599 | Time(s) 0.58 | val_loss 0.5079 | Precision 0.3622 | Recall 0.3860 | Acc 0.9160 | F1_score 0.3737 Epoch 00699 | Time(s) 0.59 | val_loss 0.5570 | Precision 0.3533 | Recall 0.3813 | Acc 0.9145 | F1_score 0.3668 Epoch 00799 | Time(s) 0.59 | val_loss 0.6020 | Precision 0.3702 | Recall 0.3860 | Acc 0.9175 | F1_score 0.3779 Epoch 00899 | Time(s) 0.60 | val_loss 0.6472 | Precision 0.3677 | Recall 0.3915 | Acc 0.9167 | F1_score 0.3792 Epoch 00999 | Time(s) 0.60 | val_loss 0.7006 | Precision 0.3618 | Recall 0.3869 | Acc 0.9158 | F1_score 0.3739 Confusion matrix: [[14848 739] [ 664 419]] # train on graph with bidirectional edges model_bi, metrics_bi = train_eval_model(GCN, g_bi, features, **params) Epoch 00099 | Time(s) 0.62 | val_loss 0.3113 | Precision 0.3251 | Recall 0.4765 | Acc 0.9017 | F1_score 0.3865 Epoch 00199 | Time(s) 0.62 | val_loss 0.3229 | Precision 0.4209 | Recall 0.4543 | Acc 0.9239 | F1_score 0.4369 Epoch 00299 | Time(s) 0.62 | val_loss 0.3302 | Precision 0.4776 | Recall 0.4331 | Acc 0.9324 | F1_score 0.4542 Epoch 00399 | Time(s) 0.63 | val_loss 0.3456 | Precision 0.5940 | Recall 0.4054 | Acc 0.9434 | F1_score 0.4819 Epoch 00499 | Time(s) 0.63 | val_loss 0.3675 | Precision 0.7227 | Recall 0.3850 | Acc 0.9504 | F1_score 0.5024 Epoch 00599 | Time(s) 0.63 | val_loss 0.3933 | Precision 0.7757 | Recall 0.3832 | Acc 0.9527 | F1_score 0.5130 Epoch 00699 | Time(s) 0.63 | val_loss 0.4211 | Precision 0.7784 | Recall 0.3795 | Acc 0.9527 | F1_score 0.5102 Epoch 00799 | Time(s) 0.63 | val_loss 0.4465 | Precision 0.7778 | Recall 0.3813 | Acc 0.9527 | F1_score 0.5118 Epoch 00899 | Time(s) 0.63 | val_loss 0.4769 | Precision 0.7874 | Recall 0.3795 | Acc 0.9530 | F1_score 0.5121 Epoch 00999 | Time(s) 0.63 | val_loss 0.5053 | Precision 0.7684 | Recall 0.3767 | Acc 0.9521 | F1_score 0.5056 Confusion matrix: [[15464 123] [ 675 408]] # plot the metrics during training fig, ax = plt.subplots(1,3, figsize=(18,5), sharex=True) ax[0].plot(metrics[\u0026#34;loss\u0026#34;][\u0026#39;train\u0026#39;], label=\u0026#39;unidir. train_loss\u0026#39;, color=\u0026#39;C0\u0026#39;) ax[0].plot(metrics[\u0026#34;loss\u0026#34;][\u0026#39;val\u0026#39;], label=\u0026#39;unidir. val_loss\u0026#39;, color=\u0026#39;C0\u0026#39;, ls=\u0026#39;:\u0026#39;) ax[1].plot(metrics[\u0026#39;f1_score\u0026#39;][\u0026#39;val\u0026#39;], label=\u0026#39;unidir. val_f1_score\u0026#39;, color=\u0026#39;C0\u0026#39;) ax[2].plot(metrics[\u0026#39;precision\u0026#39;][\u0026#39;val\u0026#39;], label=\u0026#39;unidir. val_precision\u0026#39;, color=\u0026#39;C0\u0026#39;) ax[2].plot(metrics[\u0026#39;recall\u0026#39;][\u0026#39;val\u0026#39;], label=\u0026#39;unidir. val_recall\u0026#39;, color=\u0026#39;C0\u0026#39;, ls=\u0026#39;:\u0026#39;) ax[0].plot(metrics_bi[\u0026#34;loss\u0026#34;][\u0026#39;train\u0026#39;], label=\u0026#39;bidir. train_loss\u0026#39;, color=\u0026#39;C3\u0026#39;) ax[0].plot(metrics_bi[\u0026#34;loss\u0026#34;][\u0026#39;val\u0026#39;], label=\u0026#39;bidir. val_loss\u0026#39;, color=\u0026#39;C3\u0026#39;, ls=\u0026#39;:\u0026#39;) ax[1].plot(metrics_bi[\u0026#39;f1_score\u0026#39;][\u0026#39;val\u0026#39;], label=\u0026#39;bidir. val_f1_score\u0026#39;, color=\u0026#39;C3\u0026#39;) ax[2].plot(metrics_bi[\u0026#39;precision\u0026#39;][\u0026#39;val\u0026#39;], label=\u0026#39;bidir. val_precision\u0026#39;, color=\u0026#39;C3\u0026#39;) ax[2].plot(metrics_bi[\u0026#39;recall\u0026#39;][\u0026#39;val\u0026#39;], label=\u0026#39;bidir. val_recall\u0026#39;, color=\u0026#39;C3\u0026#39;, ls=\u0026#39;:\u0026#39;) ax[0].legend() ax[1].legend() ax[2].legend(); Training the GCN model with these parameters led to a poorer performance than the one reported in the paper by Weber et al. The bidirectional graph variant produced the better results. Therefore it is also probably the setting used in the paper. Even though the Bitcoin flow from one transaction to another is intuitively one-directional, this would mean that, in a GCN, the information would only flow downstream. Having a bidirectional flow of information through the edges of the graph in the GCN makes information available to each node both upstream and downstream from it. This greatly improved the performance of the GCN.\nModel Edges Dropout Precision Recall F1 score GCN (Weber et al.) 0. 0.812 0.512 0.628 GCN unidirectional 0. 0.3764 0.3823 0.3793 GCN bidirectional 0. 0.7860 0.3832 0.5152 In order to figure out, whether there are additional parameters that I did not yet consider in replicating the GCN approach, I performed a series of experiments changing further training parameters and comparing the results. This, in turn, was useful in understanding in what ways the model could be modified to increase its performance.\nAdditional experiments with GCNs Address overfitting by weight decay L2-regularization One observation from the previous learning curves is that the validation loss starts to increase again after ca. 400 epochs, a clear sign of overfitting. One way to address this is to add dropout to the model. In this case, I added weight-decay L2-regularization to improve training.\nI further added a learnable bias to the GCN to see if this improved its performance (it did slightly). The other parameters were left intact.\nquery = \u0026#39;model==\u0026#34;gcn\u0026#34; and onlylocal==False and dropout==\u0026#34;0\u0026#34;\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) g = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;train_loss\u0026#39;, col=\u0026#39;weight_decay\u0026#39;, hue=\u0026#39;bidirectional\u0026#39;, style=\u0026#39;nobias\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); plt.ylim(0,1); g = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_loss\u0026#39;, col=\u0026#39;weight_decay\u0026#39;, hue=\u0026#39;bidirectional\u0026#39;, style=\u0026#39;nobias\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)) plt.ylim(0,1); Plotting 8 runs: ['super-sweep-14' 'blooming-sweep-13' 'vague-sweep-10' 'sandy-sweep-9' 'eternal-sweep-6' 'resilient-sweep-5' 'ethereal-sweep-2' 'vibrant-sweep-1'] query = \u0026#39;model==\u0026#34;gcn\u0026#34; and onlylocal==False and dropout==0\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, col=\u0026#39;weight_decay\u0026#39;, hue=\u0026#39;bidirectional\u0026#39;, style=\u0026#39;nobias\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); Plotting 8 runs: ['super-sweep-14' 'blooming-sweep-13' 'vague-sweep-10' 'sandy-sweep-9' 'eternal-sweep-6' 'resilient-sweep-5' 'ethereal-sweep-2' 'vibrant-sweep-1'] Address overfitting by adding dropout Another way to address this is to add dropout to the model. In this case, I added dropout before the second GCN layer (meaning that the inputs to the 2nd layer will be dropped out with a certain probability. Adding dropout considerably increased the precision, meaning that the model predicts far fewer false negatives. The recall, on the other hand, remains largely unchanged. The best model was obtained with a dropout $p = 0.5$\nquery = \u0026#39;model==\u0026#34;gcn\u0026#34; and bidirectional==True and onlylocal==False and nobias==\u0026#34;False\u0026#34; and weight_decay==\u0026#34;0.0005\u0026#34; and (dropout==0. or dropout==0.25 or dropout==\u0026#34;0.5\u0026#34;)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) fig, ax = plt.subplots(1,2, figsize=(12,5)) g1 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_loss\u0026#39;, col=\u0026#39;bidirectional\u0026#39;, hue=\u0026#39;dropout\u0026#39;, kind=\u0026#39;line\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), data=runs.query(query), ax=ax[0]) g2 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;train_loss\u0026#39;, col=\u0026#39;bidirectional\u0026#39;, hue=\u0026#39;dropout\u0026#39;, kind=\u0026#39;line\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), data=runs.query(query), ax=ax[1]) plt.close(g1.fig) plt.close(g2.fig) plt.legend(title=\u0026#39;dropout\u0026#39;, labels=[\u0026#39;0.0\u0026#39;, \u0026#39;0.25\u0026#39;, \u0026#39;0.5\u0026#39;]) plt.setp(ax, ylim=(0,0.8)); Plotting 3 runs: ['faithful-sweep-3' 'quiet-deluge-19' 'sandy-sweep-9'] query = \u0026#39;model==\u0026#34;gcn\u0026#34; and bidirectional==True and onlylocal==False and nobias==\u0026#34;False\u0026#34; and weight_decay==\u0026#34;0.0005\u0026#34; and (dropout==0. or dropout==0.25 or dropout==\u0026#34;0.5\u0026#34;)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) fig, ax = plt.subplots(1,3, figsize=(18,5)) g1 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, col=\u0026#39;bidirectional\u0026#39;, hue=\u0026#39;dropout\u0026#39;, kind=\u0026#39;line\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), data=runs.query(query), ax=ax[0]) g2 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_precision\u0026#39;, col=\u0026#39;bidirectional\u0026#39;, hue=\u0026#39;dropout\u0026#39;, kind=\u0026#39;line\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), data=runs.query(query), ax=ax[1]) g3 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_recall\u0026#39;, col=\u0026#39;bidirectional\u0026#39;, hue=\u0026#39;dropout\u0026#39;, kind=\u0026#39;line\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), data=runs.query(query), ax=ax[2]) plt.close(g1.fig) plt.close(g2.fig) plt.close(g3.fig) plt.legend(title=\u0026#39;dropout\u0026#39;, labels=[\u0026#39;0.0\u0026#39;, \u0026#39;0.25\u0026#39;, \u0026#39;0.5\u0026#39;]); Plotting 3 runs: ['faithful-sweep-3' 'quiet-deluge-19' 'sandy-sweep-9'] Training a GCN with local features only A question that arises from the paper is: how much does the graph-based information contribute to the performance of a GCN model compared to a more traditional non-graph-based approach? Weber et al. show that the node embeddings that can be extracted from a GCN can help boost other traditional models. This makes sense intuitively: because of the networked nature of the bitcoin transactions, knowing the context or \u0026ldquo;neighborhood\u0026rdquo; of a transaction should add important information.\nHowever, from the description of the Elliptic dataset we know that some of the features already contain information regarding the context of the transactions. In fact, 72 out of the 166 features are aggregated features. Therefore, I was curious to find out how would a GCN model perform with only the remaining 94 local features (including timestep) as inputs.\nI modified the model to limit the set of features to only the local ones (including timestep). The input node features thus now have a shape of (94,). This way we can assess the performance of a GCN without having to manually engineer features from their neighbors. In other words, we leave the feature engineering to the neural network itself.\n# consider only the first 94 features of the node feature matrix features_local = g_bi.ndata[\u0026#34;features_matrix\u0026#34;][:,0:94].float() print(f\u0026#34;\u0026#34;\u0026#34;Number of features (all): {features.shape[1]}\u0026#34;\u0026#34;\u0026#34;) print(f\u0026#34;\u0026#34;\u0026#34;Number of features (only local): {features_local.shape[1]}\u0026#34;\u0026#34;\u0026#34;) Number of features (all): 166 Number of features (only local): 94 # GCN model parameters #params = { # \u0026#34;bidirectional\u0026#34; : True, # \u0026#34;n_hidden\u0026#34; : 100, # \u0026#34;n_layers\u0026#34; : 2, # \u0026#34;weight_decay\u0026#34; : 5e-4, # \u0026#34;bias\u0026#34; : True, # \u0026#34;dropout\u0026#34; : 0.25, # \u0026#34;epochs\u0026#34; : 1000, # \u0026#34;lr\u0026#34; : 1e-3, # \u0026#34;posweight\u0026#34; : 0.7, #} # #model, metrics = train_eval_model(GCN, g_bi, features_local, **params) query = \u0026#39;model==\u0026#34;gcn\u0026#34; and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34; and (dropout==0 or dropout==0.25 or dropout==0.5)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, col=\u0026#39;onlylocal\u0026#39;, hue=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); Plotting 6 runs: ['divine-sweep-4' 'faithful-sweep-3' 'zesty-microwave-76' 'quiet-deluge-19' 'divine-sweep-11' 'sandy-sweep-9'] We can see that the perfomance of the GCN improved by addition of dropout when all features were considered, but not when using only local features. This makes sense, as the aggregated features are likely less essential than the local ones.\nRandom Forest benchmark A sobering additional finding from the paper by Weber et al. was the out-of-the-box excellent performance of a simple Random Forest in correctly classifying the transactions as licit or illicit. I was able to replicate these results too for two different sets of features:\nAll features Only local features from sklearn.ensemble import RandomForestClassifier # function to evaluate the model def evaluate_rfc(model, features, labels, mask): \u0026#34;\u0026#34;\u0026#34;Calculate the loss, accuracy, precision, recall and f1_score for the masked data\u0026#34;\u0026#34;\u0026#34; pred_rfc = model.predict(features[mask]) labels = labels[mask] p, r, f, _ = precision_recall_fscore_support(labels, pred_rfc) return p[1], r[1], f[1] # confusion matrix def eval_confusion_matrix_rfc(model, features, labels, mask): pred_rfc = model.predict(features[mask]) labels = labels[mask] print(confusion_matrix(labels, pred_rfc)) Using all features (local + aggregated) rfc = RandomForestClassifier(n_estimators=50, max_features=50, random_state=seed) rfc.fit(features[train_indices], labels[train_indices]) p, r, f1 = evaluate_rfc(rfc, features, labels, val_indices) print( f\u0026#34;Precision {p:.4f} | Recall {r:.4f} | \u0026#34; f\u0026#34;F1 score {f1:.4f}\u0026#34; ) print(\u0026#34;Confusion matrix:\u0026#34;) eval_confusion_matrix_rfc(rfc, features, labels, val_indices) Precision 0.9167 | Recall 0.7211 | F1 score 0.8072 Confusion matrix: [[15516 71] [ 302 781]] The best results were obtained using as input all available features (local (including timestep) + aggregated). Both precision and recall of this model were high. This is confirmed by looking at the confusion matrix. The model predicts nearly no false positives and less than 30% of the illicit transactions are falsely labeled as negatives. It is a very good performance for such a simple model.\nUsing local features only rfc_local = RandomForestClassifier(n_estimators=50, max_features=50, random_state=seed) rfc_local.fit(features_local[train_indices], labels[train_indices]) p, r, f1 = evaluate_rfc(rfc_local, features_local, labels, val_indices) print( f\u0026#34;Precision {p:.4f} | Recall {r:.4f} | \u0026#34; f\u0026#34;F1 score {f1:.4f}\u0026#34; ) eval_confusion_matrix_rfc(rfc_local, features_local, labels, val_indices) Precision 0.8749 | Recall 0.7036 | F1 score 0.7799 [[15478 109] [ 321 762]] Removing the aggregated features from the input to the Random Forest (leaving other parameters equal) impairs its performance both in precision and recall, but not dramatically. It still is a very good model working out of the box. It makes sense that not having information about the immediate neighbors of a transaction would produce a worse-performing model.\nComparison to GCN So what do the resuls of the Random Forest tell us about GCNs and other deep learning techniques? Should we dismiss them and focus on simpler models instead? While this analysis shows that it pays to start simple and see how well we can tackle a task using classic machine learning methods first, there still are valid reasons why one should consider (graph) neural networks too.\nDeep learning on graphs is cool! Now, seriously,\nThere is information contained in the connections between data points, which is not being considered by a classical machine learning approach, unless carefully crafted features are available, which requires time and specific knowledge The features available to a different dataset may be less informative than those in the Elliptic dataset, and therefore insufficient to produce a good-enough Random Forest model There may even be situations when no intrinsic node features are available and we still want to be able to classify transactions. This would still be possible using a GCN but not with a Random Forest Progress in unleashing the potential of GCNs can only be obtained by researching these networks Now let\u0026rsquo;s take a look at a different kind of graph-based model that, I figured, might be a good option for the bitcoin transaction classification task.\nLong-range propagation of label predictions using APPNP Let\u0026rsquo;s recapitulate.\nWe trained a complex GCN model to classify bitcoin transactions as licit or illicit and improved its performance by better parameters and training We found that a Random Forest model performed better than our complex GCN, almost effortlessly Why is this? In order to look for the answer it pays to take a closer look at how the transaction graph is structured. Let\u0026rsquo;s take for example the transactions of the last timestep.\nwith sns.axes_style(\u0026#39;white\u0026#39;): plt.figure(figsize=(8,6)) node_label = list(nx.get_node_attributes(g_nx_t_list[49-1], \u0026#39;label\u0026#39;).values()) mapping = {-1:\u0026#39;grey\u0026#39;, 0:\u0026#39;C0\u0026#39;, 1:\u0026#39;C3\u0026#39;} node_color = [mapping[l] for l in node_label] nx.draw_networkx(g_nx_t_list[49-1], node_size=10, node_color=node_color, with_labels=False, width=0.2, alpha=0.8, arrowsize=8) leg = plt.legend([\u0026#39;unlabeled\u0026#39;, \u0026#39;licit\u0026#39;, \u0026#39;illicit\u0026#39;]) leg.legendHandles[0].set_color(\u0026#39;grey\u0026#39;) leg.legendHandles[1].set_color(\u0026#39;C0\u0026#39;) leg.legendHandles[2].set_color(\u0026#39;C3\u0026#39;) plt.show() /anaconda3/envs/dgl/lib/python3.7/site-packages/networkx/drawing/nx_pylab.py:579: MatplotlibDeprecationWarning: The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead. if not cb.iterable(width): /anaconda3/envs/dgl/lib/python3.7/site-packages/networkx/drawing/nx_pylab.py:676: MatplotlibDeprecationWarning: The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead. if cb.iterable(node_size): # many node sizes The original paper describing the GCN applied it to the node classification of papers in the Cora dataset. The bitcoin transaction graph is much larger and less dense than the Cora graph. One can hipothesize that for a denser graph with a higher average degree, each node could receive more information from its neighbors that would help make a GCN make a better prediction.\nGraph Nodes Density Average degree Elliptic (bitcoin transactions) 203769 0.0003177 2.30 Cora (citations) 2708 0.0014812 3.90 Furthermore, if we consider a simple chain of transactions (like the ones seen in the graph visualization), a node would receive information only from the previous node and pass it on only to the following node in the chain. In such a situation, it could be that the range of neighbors that feed the classification of any given node is too short and often not sufficient for a correct prediction. If this is the case, then considering a longer-ranging neighborhood could help train a better classification model.\nApproximated Personalized Propagation of Neural Predictions (APPNP) Enter APPNP. This model was recently proposed as a way to reconcile the best of two worlds: neural message passing algorithms (in principle like the GCN), and personalized pagerank.\nIn PageRank, a measure of how central or important a node is calculated as a function of its connections and the importance of its neighbors. The pagerank $PR$ of a node $u$ is:\n$$PR(u) = (1-d){1 \\over {N}} + d \\sum_{v \\in \\mathcal{N} (u)} {PR(v) \\over D_{out}(v)}$$where $N$ is the total number of nodes, $D_{out}$ is the outdegree, and $v$ is a node of the set of neighbors $\\mathcal{N}$ of $u$. Pagerank ultimately requires each node to iteratively receive information from its neighbors until the pagerank stops changing and converges. The algorithm assigns a damping factor $d$, which can be understood by imagining a random surfer visiting nodes in the graph. The surfer would visit any given node moving along the edges of the graph until it reaches it with probability $d$. However, that surfer could also visit a node by randomly teleporting to it from elsewhere in the graph with probability $(1-d)$.\nIn APPNP, a multilayer perceptron takes the node features as input and outputs prediction probabilities $H^{0} = f_{MLP}(X)$. These are then propagated through graph for a $K$ number of iterations.\n$$H^{t+1} = (1-\\alpha)\\left(\\hat{D}^{-1/2} \\hat{A} \\hat{D}^{-1/2} H^{t}\\right) + \\alpha H^{0}$$If you think the APPNP and the pagerank equations look similar is because they are. The teleportation probability $\\alpha$ corresponds to the damping factor in pagerank. It tells us that for each iteration, the predictions for a node will depend on the normalized graph laplacian with probability $(1-\\alpha)$ or on the output of the MLP with probability $\\alpha$.\nSo how does do we train the APPNP model? I modified an implementation of an APPNP layer from DGL. In the original paper by Klicpera et al. (2019) they use an architecture consisting of a 2-layer MLP with 64 hidden units each, followed by the propagation component given by the APPNP equation. In order to use a comparable architecture to the GCN model, I decided to set the number of hidden units to 100 (same as previous section).\nI tested several combinations of values for $K$ and $\\alpha$. However, a more exhaustive hyperparameter search is needed to find the best possible configuration. Moreover, I used two versions of the feature set: all features vs. only local features.\nThe best F1 score was obtained for $K = 20$ propagation iterations and a teleportation probability $\\alpha = 0.2$. Interestingly, the model trained with only local features performed better than using all features. Furthermore, the addition of dropout to the network had a positive effect when using all features, but a negative effect if only local features were considered.\nclass APPNP(nn.Module): def __init__( self, g, in_feats, n_hidden, n_classes, n_layers, activation, feat_drop, edge_drop, alpha, k, ): super(APPNP, self).__init__() self.g = g self.layers = nn.ModuleList() # input layer self.layers.append(nn.Linear(in_feats, n_hidden)) # hidden layers for _ in range(n_layers - 2): self.layers.append(nn.Linear(n_hidden, n_hidden)) # output layer self.layers.append(nn.Linear(n_hidden, n_classes)) self.activation = activation if feat_drop: self.feat_drop = nn.Dropout(feat_drop) else: self.feat_drop = lambda x: x self.propagate = APPNPConv(k, alpha, edge_drop) self.reset_parameters() def reset_parameters(self): for layer in self.layers: layer.reset_parameters() def forward(self, features): # prediction step h = features h = self.feat_drop(h) h = self.activation(self.layers[0](h)) for layer in self.layers[1:-1]: h = self.activation(layer(h)) h = self.layers[-1](self.feat_drop(h)) # propagation step h = self.propagate(self.g, h) return h query = \u0026#39;model==\u0026#34;appnp\u0026#34; and nhidden==100 and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34; and (dropout==0 or dropout==0.2 or dropout==0.25)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, row=\u0026#39;onlylocal\u0026#39;, col=\u0026#39;alpha\u0026#39;, hue=\u0026#39;k\u0026#39;, style=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); Plotting 12 runs: ['tough-sweep-4' 'fluent-sweep-3' 'happy-sweep-2' 'eager-sweep-1' 'vivid-sweep-2' 'copper-sweep-1' 'classic-sweep-10' 'confused-sweep-9' 'generous-sweep-6' 'dulcet-sweep-5' 'avid-sweep-2' 'giddy-sweep-1'] MLP benchmark But just how much of an effect does the incorporation of the graph structure have on the performance of the model? How much of it is simply due to the MLP that is trained on top of the propagation phase of the APPNP? In order to assess this, I trained an MLP model with the same architecture and for the two different versions of the features set.\nThe results show that the APPNP model has a better F1 score than the MLP after 1000 epochs, both for the only local feature set and the full feature set. With respect to precision and F1 score, the addition of dropout was beneficial when using all features and prejudicial when using the local features only. In contrast, recall was slightly improved by adding dropout to both feature variants.\nclass MLP(nn.Module): def __init__( self, in_feats, n_hidden, n_classes, n_layers, activation, feat_drop, ): super(MLP, self).__init__() self.layers = nn.ModuleList() # input layer self.layers.append(nn.Linear(in_feats, n_hidden)) # hidden layers for _ in range(n_layers - 2): self.layers.append(nn.Linear(n_hidden, n_hidden)) # output layer self.layers.append(nn.Linear(n_hidden, n_classes)) self.activation = activation if feat_drop: self.feat_drop = nn.Dropout(feat_drop) else: self.feat_drop = lambda x: x self.reset_parameters() def reset_parameters(self): for layer in self.layers: layer.reset_parameters() def forward(self, features): # prediction step h = features h = self.feat_drop(h) h = self.activation(self.layers[0](h)) for layer in self.layers[1:-1]: h = self.activation(layer(h)) h = self.layers[-1](self.feat_drop(h)) return h #hide_input query = \u0026#39;((model==\u0026#34;appnp\u0026#34; and k==\u0026#34;20.0\u0026#34; and alpha==\u0026#34;0.2\u0026#34;) or model==\u0026#34;mlp\u0026#34;) and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34; and (dropout==0 or dropout==0.2)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, col=\u0026#39;onlylocal\u0026#39;, hue=\u0026#39;model\u0026#39;, style=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); Plotting 8 runs: ['tough-sweep-4' 'fluent-sweep-3' 'happy-sweep-2' 'crisp-sweep-4' 'lucky-sweep-3' 'summer-sweep-2' 'eager-sweep-1' 'sandy-sweep-1'] #hide_input query = \u0026#39;((model==\u0026#34;appnp\u0026#34; and k==\u0026#34;20.0\u0026#34; and alpha==\u0026#34;0.2\u0026#34;) or model==\u0026#34;mlp\u0026#34;) and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34; and (dropout==0 or dropout==0.2)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_precision\u0026#39;, col=\u0026#39;onlylocal\u0026#39;, hue=\u0026#39;model\u0026#39;, style=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); Plotting 8 runs: ['tough-sweep-4' 'fluent-sweep-3' 'happy-sweep-2' 'crisp-sweep-4' 'lucky-sweep-3' 'summer-sweep-2' 'eager-sweep-1' 'sandy-sweep-1'] #hide_input query = \u0026#39;((model==\u0026#34;appnp\u0026#34; and k==\u0026#34;20.0\u0026#34; and alpha==\u0026#34;0.2\u0026#34;) or model==\u0026#34;mlp\u0026#34;) and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34; and (dropout==0 or dropout==0.2)\u0026#39; print(f\u0026#34;Plotting {runs.query(query)[\u0026#39;name\u0026#39;].nunique()} runs: {runs.query(query)[\u0026#39;name\u0026#39;].unique()}\u0026#34;) sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_recall\u0026#39;, col=\u0026#39;onlylocal\u0026#39;, hue=\u0026#39;model\u0026#39;, style=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 2), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); Plotting 8 runs: ['tough-sweep-4' 'fluent-sweep-3' 'happy-sweep-2' 'crisp-sweep-4' 'lucky-sweep-3' 'summer-sweep-2' 'eager-sweep-1' 'sandy-sweep-1'] Putting it all together Please refer to my tl;dr :)\n#hide_input query = \u0026#39;\u0026#39;\u0026#39;((onlylocal==True and dropout==0) or (onlylocal==False and (dropout==0.2 or dropout==0.5))) and ((model==\u0026#34;appnp\u0026#34; and k==\u0026#34;20.0\u0026#34; and alpha==\u0026#34;0.2\u0026#34; and (dropout==0. or dropout==0.2)) or (model==\u0026#34;mlp\u0026#34; and (dropout==0. or dropout==0.2)) or (model==\u0026#34;gcn\u0026#34; and (dropout==0. or dropout==0.5))) and bidirectional==True and weight_decay==0.0005 and nobias==\u0026#34;False\u0026#34;\u0026#39;\u0026#39;\u0026#39;.replace(\u0026#39;\\n\u0026#39;,\u0026#39; \u0026#39;) g1 = sns.relplot(\u0026#39;epoch\u0026#39;, \u0026#39;val_f1_score\u0026#39;, col=\u0026#39;onlylocal\u0026#39;, hue=\u0026#39;model\u0026#39;, style=\u0026#39;dropout\u0026#39;, palette=sns.color_palette(\u0026#34;Set1\u0026#34;, 3), kind=\u0026#39;line\u0026#39;, data=runs.query(query)); g1.axes.flat[0].axhline(0.8072, c=\u0026#39;k\u0026#39;, alpha=0.8, ls=\u0026#39;-.\u0026#39;, lw=1) g1.axes.flat[0].text(1,0.815,\u0026#39;RandomForest\u0026#39;) g1.axes.flat[1].axhline(0.7799, c=\u0026#39;k\u0026#39;, alpha=0.8, ls=\u0026#39;-.\u0026#39;, lw=1) g1.axes.flat[1].text(1,0.785,\u0026#39;RandomForest\u0026#39;) plt.suptitle(\u0026#39;Performance of the best models of each class using all features vs. only local features\u0026#39;, y=1.02); ","permalink":"https://arcosdiaz.com/posts/2019-12-15-btc-fraud-detection/","summary":"\u003ch2 id=\"tldr\"\u003etl;dr\u003c/h2\u003e\n\u003cp\u003eI trained 4 different types of models to classify bitcoin transactions. For each, two versions of the feature set were used: \u003cem\u003eall features\u003c/em\u003e (local + neighborhood-aggregated) and \u003cem\u003eonly local features\u003c/em\u003e (without neighborhood information).\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003eThe best model was a Random Forest trained with all features: its performance was impaired when the aggregated features were removed.\u003c/li\u003e\n\u003cli\u003eThe best graph-based neural network model was APPNP and its performance was better when only local features were used. APPNP performed better than an MLP with comparable complexity, indicating that the graph structure information gave it an advantage.\u003c/li\u003e\n\u003cli\u003eFinally, the best GCN model required using all features and several strategies to reduce overfitting.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe excellent performance of a Random Forest shows that it makes sense to consider simple models when faced with a new task. It also indicates that the individual node features in the Elliptic dataset are already informative enough to make good predictions. It would be interesting to explore how the model performs, when fewer samples and/or features are available for training.\u003c/p\u003e","title":"Graph Convolutional Networks for Fraud Detection of Bitcoin Transactions"},{"content":"The goal of this notebook is to provide an analysis of the time-series data from a user of a fitbit tracker throughout a year. I will use this data to predict an additional year of the life of the user using Generalized Additive Models.\nData source: Activity, Sleep\nPackages used:\npandas, numpy, matplotlib, seaborn Prophet import pandas as pd import numpy as np from fbprophet import Prophet import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline Data cleaning (missing data and outliers) # import the activity data activity = pd.read_csv(\u0026#39;OneYearFitBitData.csv\u0026#39;) # change commas to dots activity.iloc[:,1:] = activity.iloc[:,1:].applymap(lambda x: float(str(x).replace(\u0026#39;,\u0026#39;,\u0026#39;.\u0026#39;))) # change column names to English activity.columns = [\u0026#39;Date\u0026#39;, \u0026#39;BurnedCalories\u0026#39;, \u0026#39;Steps\u0026#39;, \u0026#39;Distance\u0026#39;, \u0026#39;Floors\u0026#39;, \u0026#39;SedentaryMinutes\u0026#39;, \u0026#39;LightMinutes\u0026#39;, \u0026#39;ModerateMinutes\u0026#39;, \u0026#39;IntenseMinutes\u0026#39;, \u0026#39;IntenseActivityCalories\u0026#39;] # import the sleep data sleep = pd.read_csv(\u0026#39;OneYearFitBitDataSleep.csv\u0026#39;) # check the size of the dataframes activity.shape, sleep.shape # merge dataframes data = pd.merge(activity, sleep, how=\u0026#39;outer\u0026#39;, on=\u0026#39;Date\u0026#39;) # parse date into correct format data[\u0026#39;Date\u0026#39;] = pd.to_datetime(data[\u0026#39;Date\u0026#39;], format=\u0026#39;%d-%m-%Y\u0026#39;) # correct units for Calories and Steps for c in [\u0026#39;BurnedCalories\u0026#39;, \u0026#39;Steps\u0026#39;, \u0026#39;IntenseActivityCalories\u0026#39;]: data[c] = data[c]*1000 Once imported, we should check for any missing data:\n# check for missing data data.isnull().sum() Date 0 BurnedCalories 0 Steps 0 Distance 0 Floors 0 SedentaryMinutes 0 LightMinutes 0 ModerateMinutes 0 IntenseMinutes 0 IntenseActivityCalories 0 MinutesOfSleep 5 MinutesOfBeingAwake 5 NumberOfAwakings 5 LengthOfRestInMinutes 5 dtype: int64 # check complete rows where sleep data is missing data.iloc[np.where(data[\u0026#39;MinutesOfSleep\u0026#39;].isnull())[0],:] .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } We can see that the sleep information was missing for some dates. The activity information for those days is complete. Therefore, we should not get rid of those rows just now.\n# check rows for which steps count is zero data.iloc[np.where(data[\u0026#39;Steps\u0026#39;]==0)[0],:] .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } We can also see that the step count for some datapoints is zero. If we look at the complete rows, we can see that on those days nearly no other data was recorded. I assume that the user probably did not wear the fitness tracker on that day and we could get rid of those complete rows.\n# drop days with a step count of zero data = data.drop(np.where(data[\u0026#39;Steps\u0026#39;]==0)[0], axis=0) # plot the distribution of data for step count sns.distplot(data[\u0026#39;Steps\u0026#39;]) plt.title(\u0026#39;Histogram for step count\u0026#39;) \u0026lt;matplotlib.text.Text at 0x10fb34cc0\u0026gt; Step count is probably the most accurate measure obtained from a pedometer. Looking at the distribution of this variable, however, we can see that there is a chance that we have outliers in the data, as at least one value seems to be much higher than all the rest.\n# sort data by step count in a descending order data.sort_values(by=\u0026#39;Steps\u0026#39;, ascending=False).head() .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } We found the outlier! It seems that the step count for the first day (our data starts on May 8th, 2015) is too high to be a correct value for the amount of steps taken by the user on that day. Maybe the device saves the vibration since its production as step count which is loaded on the first day that the user wears the tracker. We can anyway get rid of that row since the sleep data is also not available for this day.\n# drop outlier data = data.drop(np.where(data[\u0026#39;Steps\u0026#39;]\u0026gt;=100000)[0], axis=0) Now we can look at our preprocessed data. Shape, distribution of the variables, and a look at some rows from the dataframe, are all useful things to observe:\ndata.shape (369, 14) fig, ax = plt.subplots(5,3, figsize=(8,10)) for c, a in zip(data.columns[1:], ax.flat): df = pd.DataFrame() df[\u0026#39;ds\u0026#39;] = data[\u0026#39;Date\u0026#39;] df[\u0026#39;y\u0026#39;] = data[c] df = df.dropna(axis=0, how=\u0026#39;any\u0026#39;) sns.distplot(df[\u0026#39;y\u0026#39;], axlabel=False, ax=a) a.set_title(c) plt.suptitle(\u0026#39;Histograms of variables from fitbit data\u0026#39;, y=1.02, fontsize=14); plt.tight_layout() data.head() .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } Predicting the step count for an additional year In order to use the Prophet package to predict the future using a Generalized Additive Model, we need to create a dataframe with columns ds and y (we need to do this for each variable):\nds is the date stamp data giving the time component y is the variable that we want to predict In our case we will use the log transform of the step count in order to decrease the effect of outliers on the model.\ndf = pd.DataFrame() df[\u0026#39;ds\u0026#39;] = data[\u0026#39;Date\u0026#39;] df[\u0026#39;y\u0026#39;] = data[\u0026#39;Steps\u0026#39;] # log-transform of step count df[\u0026#39;y\u0026#39;] = np.log(df[\u0026#39;y\u0026#39;]) Now we need to specify the type of growth model that we want to use:\nLinear: assumes that the variable y grows linearly in time (doesn\u0026rsquo;t apply to our step count scenario, if the person sticks to their normal lifestyle) Logistic: assumes that the variable y grows logistically in time and saturates at some point I will assume that the person, for whom we want to predict the step count in the following year, will not have any dramatic lifestyle changes that makes them start to walk more. Therefore, I am using logistic \u0026lsquo;growth\u0026rsquo; capped to a cap of the mean of the data, which in practice means that the step count\u0026rsquo;s growth trend will be \u0026lsquo;zero growth\u0026rsquo;.\ndf[\u0026#39;cap\u0026#39;] = df[\u0026#39;y\u0026#39;].median() m = Prophet(growth=\u0026#39;logistic\u0026#39;, yearly_seasonality=True) m.fit(df) INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. /Users/dario/anaconda/envs/datasci/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. elif np.issubdtype(np.asarray(v).dtype, float): \u0026lt;fbprophet.forecaster.Prophet at 0x115274dd8\u0026gt; After fitting the model, we need a new dataframe future with the additional rows for which we want to predict y.\nfuture = m.make_future_dataframe(periods=365, freq=\u0026#39;D\u0026#39;) future[\u0026#39;cap\u0026#39;] = df[\u0026#39;y\u0026#39;].median() Now we can call predict on the fitted model and obtain relevant statistics for the forecast period. We can also plot the results.\nforecast = m.predict(future) forecast[[\u0026#39;ds\u0026#39;, \u0026#39;yhat\u0026#39;, \u0026#39;yhat_lower\u0026#39;, \u0026#39;yhat_upper\u0026#39;]].tail() .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } m.plot(forecast, ylabel=\u0026#39;log(Steps)\u0026#39;, xlabel=\u0026#39;Date\u0026#39;); plt.title(\u0026#39;1-year prediction of step count from 1 year of fitbit data\u0026#39;); We can see that the model did a good job in mimicking the behavior of step count during the year for which the data was available. This seems reasonable, as we do not expect the pattern to vary necessarily, if the person continues to have a similar lifestyle.\nAdditionally, we can plot the components from the Generalized Additive Model and see their effect on the \u0026lsquo;y\u0026rsquo; variable. In this case we have the general trend (remember we capped this at \u0026lsquo;10\u0026rsquo;), the yearly seasonality effect, and the weekly effect.\nm.plot_components(forecast); plt.suptitle(\u0026#39;GAM components for prediction of step count\u0026#39;, y=1.02, fontsize=14); Here we see some interesting patterns:\nThe general \u0026lsquo;growth\u0026rsquo; trend is as expected, as we assumed that there would be no growth beyond the mean of the existing data. The yearly effect shows a trend towards higher activity during the summer months, however the variation is considerable, probably due to the fact that our dataset consisted of the data for one year only The weekly effect shows that Sunday is a day of lower activity for this person whereas Saturday is the day where the activity is the highest. So, grocery shopping on Saturday, Netflix on Sunday? :) Sleep analysis A very important part of our lives is sleep. It would be very interesting to look at the sleep habits of the user of the fitness tracker and see if we can get some insights from this data.\ndf = pd.DataFrame() df[\u0026#39;ds\u0026#39;] = data[\u0026#39;Date\u0026#39;] df[\u0026#39;y\u0026#39;] = data[\u0026#39;MinutesOfSleep\u0026#39;] df = df.dropna(axis=0, how=\u0026#39;any\u0026#39;) # drop rows where sleep time is zero, as this would mean that the person did not wear the tracker overnight and the data is missing df = df.iloc[np.where(df[\u0026#39;y\u0026#39;]!=0)[0],:] # distribution of MinutesOfSleep sns.distplot(df[\u0026#39;y\u0026#39;]) \u0026lt;matplotlib.axes._subplots.AxesSubplot at 0x1163770f0\u0026gt; df[\u0026#39;cap\u0026#39;] = df[\u0026#39;y\u0026#39;].median() m = Prophet(growth=\u0026#39;logistic\u0026#39;, yearly_seasonality=True) m.fit(df) future = m.make_future_dataframe(periods=365, freq=\u0026#39;D\u0026#39;) future[\u0026#39;cap\u0026#39;] = df[\u0026#39;y\u0026#39;].median() forecast = m.predict(future) m.plot(forecast); plt.title(\u0026#39;1-year prediction of MinutesOfSleep from 1 year of fitbit data\u0026#39;); INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. /Users/dario/anaconda/envs/datasci/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. elif np.issubdtype(np.asarray(v).dtype, float): The model again seems to predict a similar sleep behavior for the predicted year. This seems reasonable, as we do not expect the pattern to vary necessarily, if the person continues to have a similar lifestyle.\nm.plot_components(forecast); plt.suptitle(\u0026#39;GAM components for prediction of MinutesOfSleep\u0026#39;, y=1.02, fontsize=14); A look at the amount of sleep reveals:\nA saturation trend at the median (we set this assumption) A yearly effect shows a trend towards higher amount of sleep during the summer months, with more variation during winter The weekly effect shows lowest sleep amount on Mondays (maybe going to bed late on Sunday and waking up early on Monday is a pattern for this user). Highest amout of sleep occurs on Saturdays (no alarm to wake up to on Saturday morning!). Interestingly, the user seems to get more sleep on Wednesdays than on Mondays or Tuesdays, which could mean that their work schedule is not constant during week-days. Appendix As an exercise, I have plotted the predictions for the most interesing variables in the dataset. Enjoy!\nzeros_allowed = [\u0026#39;Floors\u0026#39;, \u0026#39;SedentaryMinutes\u0026#39;, \u0026#39;LightMinutes\u0026#39;, \u0026#39;ModerateMinutes\u0026#39;, \u0026#39;IntenseMinutes\u0026#39;, \u0026#39;IntenseActivityCalories\u0026#39;, \u0026#39;MinutesOfBeingAwake\u0026#39;, \u0026#39;NumberOfAwakings\u0026#39;] fig, ax = plt.subplots(3,3, figsize=(12,6), sharex=True) predict_cols = [\u0026#39;Steps\u0026#39;, \u0026#39;Floors\u0026#39;, \u0026#39;BurnedCalories\u0026#39;, \u0026#39;LightMinutes\u0026#39;, \u0026#39;ModerateMinutes\u0026#39;, \u0026#39;IntenseMinutes\u0026#39;, \u0026#39;MinutesOfSleep\u0026#39;, \u0026#39;MinutesOfBeingAwake\u0026#39;, \u0026#39;NumberOfAwakings\u0026#39;] for c, a in zip(predict_cols, ax.flat): df = pd.DataFrame() df[\u0026#39;ds\u0026#39;] = data[\u0026#39;Date\u0026#39;] df[\u0026#39;y\u0026#39;] = data[c] df = df.dropna(axis=0, how=\u0026#39;any\u0026#39;) if c not in zeros_allowed: df = df.iloc[np.where(df[\u0026#39;y\u0026#39;]!=0)[0],:] df[\u0026#39;cap\u0026#39;] = df[\u0026#39;y\u0026#39;].median() m = Prophet(growth=\u0026#39;logistic\u0026#39;, yearly_seasonality=True) m.fit(df) future = m.make_future_dataframe(periods=365, freq=\u0026#39;D\u0026#39;) future[\u0026#39;cap\u0026#39;] = df[\u0026#39;y\u0026#39;].median() future.tail() forecast = m.predict(future) forecast[[\u0026#39;ds\u0026#39;, \u0026#39;yhat\u0026#39;, \u0026#39;yhat_lower\u0026#39;, \u0026#39;yhat_upper\u0026#39;]].tail() m.plot(forecast, xlabel=\u0026#39;\u0026#39;, ax=a); a.set_title(c) #m.plot_components(forecast); plt.suptitle(\u0026#39;1-year prediction per variable from 1 year of fitbit data\u0026#39;, y=1.02, fontsize=14); plt.tight_layout() INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. /Users/dario/anaconda/envs/datasci/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. elif np.issubdtype(np.asarray(v).dtype, float): INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:fbprophet.forecaster:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. ","permalink":"https://arcosdiaz.com/archive/2018-04-01-fitbit_prophet/","summary":"\u003cp\u003eThe goal of this notebook is to provide an analysis of the time-series data from a user of a fitbit tracker throughout a year. I will use this data to predict an additional year of the life of the user using \u003ca href=\"https://en.wikipedia.org/wiki/Generalized_additive_model\"\u003eGeneralized Additive Models\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://algo-data.quora.com/Data-sets-of-any-type-some-links\"\u003eData source\u003c/a\u003e: \u003ca href=\"https://drive.google.com/open?id=0Bx4yoK5aogTSbGJ2WlkwYjlHejQ\"\u003eActivity\u003c/a\u003e, \u003ca href=\"https://drive.google.com/open?id=0Bx4yoK5aogTSMUFqRjVNcko5WlU\"\u003eSleep\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003ePackages used:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003epandas, numpy, matplotlib, seaborn\u003c/li\u003e\n\u003cli\u003e\u003ca href=\"https://github.com/facebook/prophet\"\u003eProphet\u003c/a\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e pandas \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e pd\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e numpy \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e np\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003efrom\u003c/span\u003e fbprophet \u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e Prophet\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e matplotlib.pyplot \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e plt\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e seaborn \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e sns\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003e%\u003c/span\u003ematplotlib inline\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003ch2 id=\"data-cleaning-missing-data-and-outliers\"\u003eData cleaning (missing data and outliers)\u003c/h2\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# import the activity data\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003eactivity \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e pd\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eread_csv(\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;OneYearFitBitData.csv\u0026#39;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# change commas to dots\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003eactivity\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eiloc[:,\u003cspan style=\"color:#ae81ff\"\u003e1\u003c/span\u003e:] \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e activity\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eiloc[:,\u003cspan style=\"color:#ae81ff\"\u003e1\u003c/span\u003e:]\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eapplymap(\u003cspan style=\"color:#66d9ef\"\u003elambda\u003c/span\u003e x: float(str(x)\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003ereplace(\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;,\u0026#39;\u003c/span\u003e,\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;.\u0026#39;\u003c/span\u003e)))\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# change column names to English\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003eactivity\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003ecolumns \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e [\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Date\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;BurnedCalories\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Steps\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Distance\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Floors\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;SedentaryMinutes\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;LightMinutes\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;ModerateMinutes\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;IntenseMinutes\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;IntenseActivityCalories\u0026#39;\u003c/span\u003e]\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# import the sleep data\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003esleep \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e pd\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eread_csv(\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;OneYearFitBitDataSleep.csv\u0026#39;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# check the size of the dataframes\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003eactivity\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eshape, sleep\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eshape\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# merge dataframes\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003edata \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e pd\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003emerge(activity, sleep, how\u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;outer\u0026#39;\u003c/span\u003e, on\u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Date\u0026#39;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# parse date into correct format\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003edata[\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Date\u0026#39;\u003c/span\u003e] \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e pd\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eto_datetime(data[\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Date\u0026#39;\u003c/span\u003e], format\u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;\u003c/span\u003e\u003cspan style=\"color:#e6db74\"\u003e%d\u003c/span\u003e\u003cspan style=\"color:#e6db74\"\u003e-%m-%Y\u0026#39;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# correct units for Calories and Steps\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#66d9ef\"\u003efor\u003c/span\u003e c \u003cspan style=\"color:#f92672\"\u003ein\u003c/span\u003e [\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;BurnedCalories\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Steps\u0026#39;\u003c/span\u003e, \u003cspan style=\"color:#e6db74\"\u003e\u0026#39;IntenseActivityCalories\u0026#39;\u003c/span\u003e]:\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e    data[c] \u003cspan style=\"color:#f92672\"\u003e=\u003c/span\u003e data[c]\u003cspan style=\"color:#f92672\"\u003e*\u003c/span\u003e\u003cspan style=\"color:#ae81ff\"\u003e1000\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003cp\u003eOnce imported, we should check for any missing data:\u003c/p\u003e","title":"Fitbit activity and sleep data: a time-series analysis with Generalized Additive Models"},{"content":"This notebook describes my approach to the Kaggle competition named in the title. This was a research competition at Kaggle in cooperation with the Memorial Sloan Kettering Cancer Center (MSKCC).\nThe goal of the competition was to create a machine learning algorithm that can classify genetic variations that are present in cancer cells.\nTumors contain cells with many different abnormal mutations in their DNA: some of these mutations are the drivers of tumor growth, whereas others are neutral and considered passengers. Normally, mutations are manually classified into different categories after literature review by clinicians. The dataset made available for this competition contains mutations that have been manually anotated into 9 different categories. The goal is to predict the correct category of mutations in the test set.\nThe model and submission described here got me to the 140th place (out of 1386 teams) or top 11%.\nData The data comes in two different kinds of files: one of them contains information about the genetic variants (training_variants and stage2_test_variants.csv) and the other contains the text (clinical evidence) that was used to manually classify the variants (training_text and stage2_test_text.csv). The training data contains a class target feature corresponding to one of the 9 categories that variants can be classified as.\nNote: the \u0026ldquo;stage2\u0026rdquo; prefix of the test files is due to the nature of the competition. There was an initial test set that was used at the beginning of the competition and a \u0026ldquo;stage2\u0026rdquo; test set that was used in the final week before the deadline to make the submissions.\nimport os import re import string import pandas as pd import numpy as np train_variant = pd.read_csv(\u0026#34;input/training_variants\u0026#34;) test_variant = pd.read_csv(\u0026#34;input/stage2_test_variants.csv\u0026#34;) train_text = pd.read_csv(\u0026#34;input/training_text\u0026#34;, sep=\u0026#34;\\|\\|\u0026#34;, engine=\u0026#39;python\u0026#39;, header=None, skiprows=1, names=[\u0026#34;ID\u0026#34;,\u0026#34;Text\u0026#34;]) test_text = pd.read_csv(\u0026#34;input/stage2_test_text.csv\u0026#34;, header=None, skiprows=1, names=[\u0026#34;ID\u0026#34;, \u0026#34;Text\u0026#34;]) train = pd.merge(train_variant, train_text, how=\u0026#39;left\u0026#39;, on=\u0026#39;ID\u0026#39;) train_y = train[\u0026#39;Class\u0026#39;].values train_x = train.drop(\u0026#39;Class\u0026#39;, axis=1) train_size=len(train_x) print(\u0026#39;Number of training variants: %d\u0026#39; % (train_size)) # number of train data : 3321 test_x = pd.merge(test_variant, test_text, how=\u0026#39;left\u0026#39;, on=\u0026#39;ID\u0026#39;) test_size=len(test_x) print(\u0026#39;Number of test variants: %d\u0026#39; % (test_size)) # number of test data : 5668 test_index = test_x[\u0026#39;ID\u0026#39;].values all_data = np.concatenate((train_x, test_x), axis=0) all_data = pd.DataFrame(all_data) all_data.columns = [\u0026#34;ID\u0026#34;, \u0026#34;Gene\u0026#34;, \u0026#34;Variation\u0026#34;, \u0026#34;Text\u0026#34;] Number of training variants: 3321 Number of test variants: 986 all_data.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } The data from the different train and test files is now consolidated into one single file. This is necessary for the correct vectorization of the text data and categorical data later on. We can see that the text information resembles scientific article text. We will process this consolidated file in the next step.\nPreprocessing In order to be able to use this data to train a machine learning model, we need to extract the features from the dataset. This means that we have to transform the text data into vectors that can be understood by an algorithm. As I am not an expert in Natural Language Processing, I applied a modified version of this script published on Kaggle. Afterwards we will have the data in a form that I can use to train a neural network.\n# Pre-processing script by Aly Osama https://www.kaggle.com/alyosama/doc2vec-with-keras-0-77 from nltk.corpus import stopwords from gensim.models.doc2vec import LabeledSentence from gensim import utils def constructLabeledSentences(data): sentences=[] for index, row in data.iteritems(): sentences.append(LabeledSentence(utils.to_unicode(row).split(), [\u0026#39;Text\u0026#39; + \u0026#39;_%s\u0026#39; % str(index)])) return sentences def textClean(text): text = re.sub(r\u0026#34;[^A-Za-z0-9^,!.\\/\u0026#39;+-=]\u0026#34;, \u0026#34; \u0026#34;, str(text)) text = text.lower().split() stops = set(stopwords.words(\u0026#34;english\u0026#34;)) text = [w for w in text if not w in stops] text = \u0026#34; \u0026#34;.join(text) return(text) def cleanup(text): text = textClean(text) text= text.translate(str.maketrans(\u0026#34;\u0026#34;,\u0026#34;\u0026#34;, string.punctuation)) return text allText = all_data[\u0026#39;Text\u0026#39;].apply(cleanup) sentences = constructLabeledSentences(allText) allText.head() Using TensorFlow backend. 0 cyclindependent kinases cdks regulate variety ... 1 abstract background nonsmall cell lung cancer ... 2 abstract background nonsmall cell lung cancer ... 3 recent evidence demonstrated acquired uniparen... 4 oncogenic mutations monomeric casitas blineage... Name: Text, dtype: object # Pre-processing script by Aly Osama https://www.kaggle.com/alyosama/doc2vec-with-keras-0-77 # PROCESS TEXT DATA from gensim.models import Doc2Vec Text_INPUT_DIM=300 text_model=None filename=\u0026#39;docEmbeddings_5_clean.d2v\u0026#39; if os.path.isfile(filename): text_model = Doc2Vec.load(filename) else: text_model = Doc2Vec(min_count=1, window=5, size=Text_INPUT_DIM, sample=1e-4, negative=5, workers=4, iter=5,seed=1) text_model.build_vocab(sentences) text_model.train(sentences, total_examples=text_model.corpus_count, epochs=text_model.iter) text_model.save(filename) text_train_arrays = np.zeros((train_size, Text_INPUT_DIM)) text_test_arrays = np.zeros((test_size, Text_INPUT_DIM)) for i in range(train_size): text_train_arrays[i] = text_model.docvecs[\u0026#39;Text_\u0026#39;+str(i)] j=0 for i in range(train_size,train_size+test_size): text_test_arrays[j] = text_model.docvecs[\u0026#39;Text_\u0026#39;+str(i)] j=j+1 print(text_train_arrays[0][:10]) # PROCESS GENE DATA from sklearn.decomposition import TruncatedSVD Gene_INPUT_DIM=25 svd = TruncatedSVD(n_components=25, n_iter=Gene_INPUT_DIM, random_state=12) one_hot_gene = pd.get_dummies(all_data[\u0026#39;Gene\u0026#39;]) truncated_one_hot_gene = svd.fit_transform(one_hot_gene.values) one_hot_variation = pd.get_dummies(all_data[\u0026#39;Variation\u0026#39;]) truncated_one_hot_variation = svd.fit_transform(one_hot_variation.values) # ENCODE THE LABELS FROM INTEGERS TO VECTORS from keras.utils import np_utils from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() label_encoder.fit(train_y) encoded_y = np_utils.to_categorical((label_encoder.transform(train_y))) print(encoded_y[0]) We have processed the train labels, as printed above (encoded_y), into vectors that contain 1 in the index of the category that the sample belongs to, and zeros in all other indexes.\nMoreover, the training and test sets are now stacked together to look like this:\ntrain_set=np.hstack((truncated_one_hot_gene[:train_size],truncated_one_hot_variation[:train_size],text_train_arrays)) test_set=np.hstack((truncated_one_hot_gene[train_size:],truncated_one_hot_variation[train_size:],text_test_arrays)) print(\u0026#39;Training set shape is: \u0026#39;, train_set.shape) # (3321, 350) print(\u0026#39;Test set shape is: \u0026#39;, test_set.shape) # (986, 350) print(\u0026#39;Training set example rows:\u0026#39;) print(train_set[0][:10]) # [ -2.46065582e-23 -5.21548048e-19 -1.95048372e-20 -2.44542833e-22 # -1.19176742e-22 1.61985461e-25 2.93618862e-25 -6.23860891e-27 # 1.14583929e-28 -1.79996588e-29] print(\u0026#39;Test set example rows:\u0026#39;) print(test_set[0][:10]) # [ 9.74220189e-33 -1.31484613e-27 4.37925347e-27 -9.88109317e-29 # 7.66365772e-27 6.58254980e-26 -3.74901712e-26 -8.97613299e-26 # -3.75471102e-23 -1.05563623e-21] Our data is now ready to be fed into a machine learning model, in this case, into a neural network in TensorFlow.\nTraining a 4-layer neural network for classification The next step is to create a neural network on TensorFlow. I am using a fully-connected neural network with 4 layers. For details on how the network is built, you can check my TensorFlow MNIST notebook. Wherever necessary, I will explains what adaptations were specifically necessary for this challenge.\nimport math import time import matplotlib.pyplot as plt import numpy as np import pandas as pd import tensorflow as tf from sklearn.model_selection import train_test_split from tensorflow.python.framework import ops %matplotlib inline np.random.seed(1) I found it useful to add the current timestamp to the name of the files that the code will output. This helped me to uniquely identify the results from each run.\ntimestr = time.strftime(\u0026#34;%Y%m%d-%H%M%S\u0026#34;) dirname = \u0026#39;output/\u0026#39; # output directory filename = \u0026#39;\u0026#39; I select 20% of the training data to use as a validation set and be able to quantify my variance (watch out for overfitting), as I don\u0026rsquo;t want to have an algorithm that only works well with this specific training data set that was provided, but one that generalizes as well as possible.\n# split data into training and validation sets X_train, X_val, Y_train, Y_val = train_test_split(train_set, encoded_y, test_size=0.20, random_state=42) X_train, X_val, Y_train, Y_val = X_train.T, X_val.T, Y_train.T, Y_val.T # transpose test set X_test = test_set.T # view data set shapes print(\u0026#39;X_train: \u0026#39;, X_train.shape) print(\u0026#39;X_val: \u0026#39;, X_val.shape) print(\u0026#39;Y_train: \u0026#39;, Y_train.shape) print(\u0026#39;Y_val: \u0026#39;, Y_val.shape) print(\u0026#39;X_test: \u0026#39;, X_test.shape) X_train: (350, 2656) X_val: (350, 665) Y_train: (9, 2656) Y_val: (9, 665) X_test: (350, 986) Now I define the functions needed to build the neural network.\ndef create_placeholders(n_x, n_y): \u0026#34;\u0026#34;\u0026#34; Creates the placeholders for the tensorflow session. Arguments: n_x -- scalar, dimensions of the input n_y -- scalar, number of classes (from 0 to 8, so -\u0026gt; 9) Returns: X -- placeholder for the data input, of shape [n_x, None] and dtype \u0026#34;float\u0026#34; Y -- placeholder for the input labels, of shape [n_y, None] and dtype \u0026#34;float\u0026#34; \u0026#34;\u0026#34;\u0026#34; X = tf.placeholder(tf.float32, shape=(n_x, None), name=\u0026#39;X\u0026#39;) Y = tf.placeholder(tf.float32, shape=(n_y, None), name=\u0026#39;Y\u0026#39;) return X, Y def initialize_parameters(): \u0026#34;\u0026#34;\u0026#34; Initializes parameters to build a neural network with tensorflow. Returns: parameters -- a dictionary of tensors containing W and b for every layer \u0026#34;\u0026#34;\u0026#34; tf.set_random_seed(1) W1 = tf.get_variable(\u0026#39;W1\u0026#39;, [350, X_train.shape[0]], initializer=tf.contrib.layers.xavier_initializer(seed=1)) b1 = tf.get_variable(\u0026#39;b1\u0026#39;, [350, 1], initializer=tf.zeros_initializer()) W2 = tf.get_variable(\u0026#39;W2\u0026#39;, [350, 350], initializer=tf.contrib.layers.xavier_initializer(seed=1)) b2 = tf.get_variable(\u0026#39;b2\u0026#39;, [350, 1], initializer=tf.zeros_initializer()) W3 = tf.get_variable(\u0026#39;W3\u0026#39;, [100, 350], initializer=tf.contrib.layers.xavier_initializer(seed=1)) b3 = tf.get_variable(\u0026#39;b3\u0026#39;, [100, 1], initializer=tf.zeros_initializer()) W4 = tf.get_variable(\u0026#39;W4\u0026#39;, [9, 100], initializer=tf.contrib.layers.xavier_initializer(seed=1)) b4 = tf.get_variable(\u0026#39;b4\u0026#39;, [9, 1], initializer=tf.zeros_initializer()) parameters = {\u0026#34;W1\u0026#34;: W1, \u0026#34;b1\u0026#34;: b1, \u0026#34;W2\u0026#34;: W2, \u0026#34;b2\u0026#34;: b2, \u0026#34;W3\u0026#34;: W3, \u0026#34;b3\u0026#34;: b3, \u0026#34;W4\u0026#34;: W4, \u0026#34;b4\u0026#34;: b4} return parameters def forward_propagation(X, parameters, keep_prob1, keep_prob2): \u0026#34;\u0026#34;\u0026#34; Implements the forward propagation for the model: (LINEAR -\u0026gt; RELU)^3 -\u0026gt; LINEAR -\u0026gt; SOFTMAX Arguments: X -- input dataset placeholder, of shape (input size, number of examples) parameters -- python dictionary containing your parameters \u0026#34;W\u0026#34; and \u0026#34;b\u0026#34; for every layer the shapes are given in initialize_parameters Returns: Z4 -- the output of the last LINEAR unit (logits) \u0026#34;\u0026#34;\u0026#34; # Retrieve the parameters from the dictionary \u0026#34;parameters\u0026#34; W1 = parameters[\u0026#39;W1\u0026#39;] b1 = parameters[\u0026#39;b1\u0026#39;] W2 = parameters[\u0026#39;W2\u0026#39;] b2 = parameters[\u0026#39;b2\u0026#39;] W3 = parameters[\u0026#39;W3\u0026#39;] b3 = parameters[\u0026#39;b3\u0026#39;] W4 = parameters[\u0026#39;W4\u0026#39;] b4 = parameters[\u0026#39;b4\u0026#39;] Z1 = tf.matmul(W1, X) + b1 # Z1 = np.dot(W1, X) + b1 A1 = tf.nn.relu(Z1) # A1 = relu(Z1) A1 = tf.nn.dropout(A1, keep_prob1) # add dropout Z2 = tf.matmul(W2, A1) + b2 # Z2 = np.dot(W2, a1) + b2 A2 = tf.nn.relu(Z2) # A2 = relu(Z2) A2 = tf.nn.dropout(A2, keep_prob2) # add dropout Z3 = tf.matmul(W3, A2) + b3 # Z3 = np.dot(W3,Z2) + b3 A3 = tf.nn.relu(Z3) Z4 = tf.matmul(W4, A3) + b4 return Z4 def compute_cost(Z4, Y): \u0026#34;\u0026#34;\u0026#34; Computes the cost Arguments: Z4 -- output of forward propagation (output of the last LINEAR unit), of shape (n_classes, number of examples) Y -- \u0026#34;true\u0026#34; labels vector placeholder, same shape as Z4 Returns: cost - Tensor of the cost function \u0026#34;\u0026#34;\u0026#34; # transpose to fit the tensorflow requirement for tf.nn.softmax_cross_entropy_with_logits(...,...) logits = tf.transpose(Z4) labels = tf.transpose(Y) cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)) return cost def random_mini_batches(X, Y, mini_batch_size, seed=0): \u0026#34;\u0026#34;\u0026#34; Creates a list of random minibatches from (X, Y) Arguments: X -- input data, of shape (input size, number of examples) Y -- true \u0026#34;label\u0026#34; vector, of shape (1, number of examples) mini_batch_size - size of the mini-batches, integer seed Returns: mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) \u0026#34;\u0026#34;\u0026#34; m = X.shape[1] # number of training examples mini_batches = [] np.random.seed(seed) # Step 1: Shuffle (X, Y) permutation = list(np.random.permutation(m)) shuffled_X = X[:, permutation] shuffled_Y = Y[:, permutation].reshape((Y.shape[0], m)) # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case. num_complete_minibatches = math.floor( m / mini_batch_size) # number of mini batches of size mini_batch_size in your partitioning for k in range(0, num_complete_minibatches): mini_batch_X = shuffled_X[:, k * mini_batch_size: k * mini_batch_size + mini_batch_size] mini_batch_Y = shuffled_Y[:, k * mini_batch_size: k * mini_batch_size + mini_batch_size] mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) # Handling the end case (last mini-batch \u0026lt; mini_batch_size) if m % mini_batch_size != 0: mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size: m] mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size: m] mini_batch = (mini_batch_X, mini_batch_Y) mini_batches.append(mini_batch) return mini_batches def predict(X, parameters): W1 = tf.convert_to_tensor(parameters[\u0026#39;W1\u0026#39;]) b1 = tf.convert_to_tensor(parameters[\u0026#34;b1\u0026#34;]) W2 = tf.convert_to_tensor(parameters[\u0026#34;W2\u0026#34;]) b2 = tf.convert_to_tensor(parameters[\u0026#34;b2\u0026#34;]) W3 = tf.convert_to_tensor(parameters[\u0026#34;W3\u0026#34;]) b3 = tf.convert_to_tensor(parameters[\u0026#34;b3\u0026#34;]) W4 = tf.convert_to_tensor(parameters[\u0026#34;W4\u0026#34;]) b4 = tf.convert_to_tensor(parameters[\u0026#34;b4\u0026#34;]) params = {\u0026#34;W1\u0026#34;: W1, \u0026#34;b1\u0026#34;: b1, \u0026#34;W2\u0026#34;: W2, \u0026#34;b2\u0026#34;: b2, \u0026#34;W3\u0026#34;: W3, \u0026#34;b3\u0026#34;: b3, \u0026#34;W4\u0026#34;: W4, \u0026#34;b4\u0026#34;: b4} x = tf.placeholder(\u0026#34;float\u0026#34;, [X_train.shape[0], None]) keep_prob1 = tf.placeholder(tf.float32, name=\u0026#39;keep_prob1\u0026#39;) keep_prob2 = tf.placeholder(tf.float32, name=\u0026#39;keep_prob2\u0026#39;) z4 = forward_propagation(x, params, keep_prob1, keep_prob2) p = tf.nn.softmax(z4, dim=0) # dim=0 because the classes are on that axis # p = tf.argmax(z4) # this gives only the predicted class as output sess = tf.Session() prediction = sess.run(p, feed_dict={x: X, keep_prob1: 1.0, keep_prob2: 1.0}) return prediction And now I define the model function which is in fact the neural network that we will train afterwards. An important difference with respect to my previous MNIST example is that I added an additional regularization term to the cost function. I used L2 regularization to penalize the weights in all four layers. The bias was not penalized as this is not necessary. The strictness of this penalty was given by a beta constant defined at 0.01.\nWhy use additional regularization? Because this allowed me to decrease the variance, i.e. decrease the difference in performance of the model with the training set compared to the validation set. This produced my best submission in the competition.\ndef model(X_train, Y_train, X_test, Y_test, learning_rate=0.0001, num_epochs=1000, minibatch_size=64, print_cost=True): \u0026#34;\u0026#34;\u0026#34; Implements a four-layer tensorflow neural network: (LINEAR-\u0026gt;RELU)^3-\u0026gt;LINEAR-\u0026gt;SOFTMAX. Arguments: X_train -- training set, of shape (input size, number of training examples) Y_train -- test set, of shape (output size, number of training examples) X_test -- training set, of shape (input size, number of training examples) Y_test -- test set, of shape (output size, number of test examples) learning_rate -- learning rate of the optimization num_epochs -- number of epochs of the optimization loop minibatch_size -- size of a minibatch print_cost -- True to print the cost every 100 epochs Returns: parameters -- parameters learnt by the model. They can then be used to predict. \u0026#34;\u0026#34;\u0026#34; ops.reset_default_graph() # to be able to rerun the model without overwriting tf variables tf.set_random_seed(1) # to keep consistent results seed = 3 # to keep consistent results (n_x, m) = X_train.shape # (n_x: input size, m : number of examples in the train set) n_y = Y_train.shape[0] # n_y : output size costs = [] # To keep track of the cost t0 = time.time() # to mark the start of the training # Create Placeholders of shape (n_x, n_y) X, Y = create_placeholders(n_x, n_y) keep_prob1 = tf.placeholder(tf.float32, name=\u0026#39;keep_prob1\u0026#39;) # probability to keep a unit during dropout keep_prob2 = tf.placeholder(tf.float32, name=\u0026#39;keep_prob2\u0026#39;) # Initialize parameters parameters = initialize_parameters() # Forward propagation Z4 = forward_propagation(X, parameters, keep_prob1, keep_prob2) # Cost function cost = compute_cost(Z4, Y) regularizers = tf.nn.l2_loss(parameters[\u0026#39;W1\u0026#39;]) + tf.nn.l2_loss(parameters[\u0026#39;W2\u0026#39;]) + tf.nn.l2_loss(parameters[\u0026#39;W3\u0026#39;]) \\ + tf.nn.l2_loss(parameters[\u0026#39;W4\u0026#39;]) # add regularization term beta = 0.01 # regularization constant cost = tf.reduce_mean(cost + beta * regularizers) # cost with regularization # Backpropagation: Define the tensorflow AdamOptimizer. optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost) # Initialize all the variables init = tf.global_variables_initializer() # Start the session to compute the tensorflow graph with tf.Session() as sess: # Run the initialization sess.run(init) # Do the training loop for epoch in range(num_epochs): epoch_cost = 0. # Defines a cost related to an epoch num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set seed = seed + 1 minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed) for minibatch in minibatches: # Select a minibatch (minibatch_X, minibatch_Y) = minibatch # Run the session to execute the \u0026#34;optimizer\u0026#34; and the \u0026#34;cost\u0026#34; _, minibatch_cost = sess.run([optimizer, cost], feed_dict={X: minibatch_X, Y: minibatch_Y, keep_prob1: 0.7, keep_prob2: 0.5}) epoch_cost += minibatch_cost / num_minibatches # Print the cost every epoch if print_cost == True and epoch % 100 == 0: print(\u0026#34;Cost after epoch {}: {:f}\u0026#34;.format(epoch, epoch_cost)) if print_cost == True and epoch % 5 == 0: costs.append(epoch_cost) # lets save the parameters in a variable parameters = sess.run(parameters) print(\u0026#34;Parameters have been trained!\u0026#34;) # Calculate the correct predictions correct_prediction = tf.equal(tf.argmax(Z4), tf.argmax(Y)) # Calculate accuracy on the test set accuracy = tf.reduce_mean(tf.cast(correct_prediction, \u0026#34;float\u0026#34;)) train_cost = cost.eval({X: X_train, Y: Y_train, keep_prob1: 1.0, keep_prob2: 1.0}) test_cost = cost.eval({X: X_test, Y: Y_test, keep_prob1: 1.0, keep_prob2: 1.0}) train_accuracy = accuracy.eval({X: X_train, Y: Y_train, keep_prob1: 1.0, keep_prob2: 1.0}) test_accuracy = accuracy.eval({X: X_test, Y: Y_test, keep_prob1: 1.0, keep_prob2: 1.0}) print(\u0026#39;Finished training in %s s\u0026#39; % (time.time() - t0)) print(\u0026#34;Train Cost:\u0026#34;, train_cost) print(\u0026#34;Test Cost:\u0026#34;, test_cost) print(\u0026#34;Train Accuracy:\u0026#34;, train_accuracy) print(\u0026#34;Test Accuracy:\u0026#34;, test_accuracy) # plot the cost plt.plot(np.squeeze(costs)) plt.ylabel(\u0026#39;cost\u0026#39;) plt.xlabel(\u0026#39;iterations (per fives)\u0026#39;) plt.title(\u0026#34;Learning rate = {}, beta = {},\\n\u0026#34; \u0026#34;test cost = {:.6f}, test accuracy = {:.6f}\u0026#34;.format(learning_rate, beta, test_cost, test_accuracy)) global filename filename = timestr + \u0026#39;_NN4Lstage2_lr_{}_beta_{}_cost_{:.2f}-{:.2f}_acc_{:.2f}-{:.2f}\u0026#39;.format( learning_rate, beta, train_cost, test_cost, train_accuracy, test_accuracy) plt.savefig(dirname + filename + \u0026#39;.png\u0026#39;) return parameters Note that the model function will return the learned parameters from the network and additionally will plot the cost after each epoch. The plot is also saved as a file that includes the timestamp as well as the learning rate, beta, cost and accuracy information for this particular run.\nNow it\u0026rsquo;s time to train the model using the train and validation data:\n# train the model and get learned parameters parameters = model(X_train, Y_train, X_val, Y_val) Cost after epoch 0: 6.607861 Cost after epoch 100: 1.389869 Cost after epoch 200: 0.988806 Cost after epoch 300: 0.882713 Cost after epoch 400: 0.833693 Cost after epoch 500: 0.811457 Cost after epoch 600: 0.793379 Cost after epoch 700: 0.773927 Cost after epoch 800: 0.762247 Cost after epoch 900: 0.767449 Parameters have been trained! Finished training in 498.4203100204468 s Train Cost: 0.665462 Test Cost: 1.74987 Train Accuracy: 0.979292 Test Accuracy: 0.643609 From my validation results we can observe that the network learned nicely. However, the final cost of the training data was 0.665462, where as the validation data had a final cost of 1.74987. This is a large difference and an indication that the model is overfitting. Moreover the accuracy (defined here as the fraction of correct predictions) is very high (97.9%) for the training data and only 64.3% for the validation set. Another indication that the model is overfitting even though I have used both dropout and L2 regularization to counteract this.\nMake predictions We use the learned parameteres to make a prediction on the test data.\n# use learned parameters to make prediction on test data prediction = predict(X_test, parameters) Let\u0026rsquo;s look at an example of a prediction. As we can see below, the prediction consists of the probabilities of the entry belongin to each of the nine different categories (this was the format needed for this competition).\nprediction[:,0] array([ 0.36503336, 0.21219006, 0.01297534, 0.14676626, 0.08375936, 0.09217557, 0.02737238, 0.03150512, 0.02822249], dtype=float32) prediction.shape (9, 986) All we have to do now is create a submission .csv file to save our prediction results.\n# create submission file submission = pd.DataFrame(prediction.T) submission[\u0026#39;id\u0026#39;] = test_index submission.columns = [\u0026#39;class1\u0026#39;, \u0026#39;class2\u0026#39;, \u0026#39;class3\u0026#39;, \u0026#39;class4\u0026#39;, \u0026#39;class5\u0026#39;, \u0026#39;class6\u0026#39;, \u0026#39;class7\u0026#39;, \u0026#39;class8\u0026#39;, \u0026#39;class9\u0026#39;, \u0026#39;id\u0026#39;] submission.to_csv(dirname + filename + \u0026#39;.csv\u0026#39;, index=False) Results interpretation Using this neural network model, my submission to Kaggle yielded following results:\nPublic score (based on a portion of the test data by Kaggle to provide an indication of performance during the competition): Loss = 1.69148 Private score (based on a different portion of the test data by Kaggle to provide the final score at the end of the competition): Loss = 2.74500 The discrepancy between these two scores further shows that overfitting is an issue in working with this data in a neural network model. My model could benefit from increasing the training data and a higher regularization.\n","permalink":"https://arcosdiaz.com/posts/2017-10-07-personalized-medicine/","summary":"\u003cp\u003eThis notebook describes my approach to the \u003ca href=\"https://www.kaggle.com/c/msk-redefining-cancer-treatment\"\u003eKaggle competition\u003c/a\u003e named in the title. This was a research competition at Kaggle in cooperation with the Memorial Sloan Kettering Cancer Center (MSKCC).\u003c/p\u003e\n\u003cp\u003eThe goal of the competition was to create a machine learning algorithm that can classify genetic variations that are present in cancer cells.\u003c/p\u003e\n\u003cp\u003eTumors contain cells with many different abnormal mutations in their DNA: some of these mutations are the drivers of tumor growth, whereas others are neutral and considered \u003cem\u003epassengers\u003c/em\u003e. Normally, mutations are manually classified into different categories after literature review by clinicians. The dataset made available for this competition contains mutations that have been manually anotated into 9 different categories. The goal is to predict the correct category of mutations in the test set.\u003c/p\u003e","title":"Personalized Medicine Kaggle Competition"},{"content":"Health care systems world-wide are under pressure due to the high costs associated with disease. Now more than ever, particularly in developed countries, we have access to the latest advancements in medicine. This contrasts with the challenge of making those treatments available to as many patients as possible. It is imperative to find ways maximize the positive impact on the quality of life of patients, while maintaining a sustainable health care system. For this purpose I performed an analysis of Medicare data in the USA. Furthermore I used a drug-disease open database to cluster the costs by disease. I identified the most expensive diseases (mostly chronic diseases such as Diabetes) and the most expensive medicines. A drug for the treatment of HCV infections (Harvoni) stands out with the highest total costs in 2015. After this first exploration, I propose the in-depth analysis of further data to enable more targeted conclusions and recommendations to improve health care, such as linking of price databases to compare drug costs for the similar indications or the analysis of population data registers that document life style characteristics of healthy and sick individuals to identify those at risk of developing high-cost diseases.\nRelevance Health care costs amount to a considerable part of the national budgets all over the world. In 2015, $3.2 trillion were spent for health care in the USA (17.8% of its GDP). In Germany, the health care spending reached 11.3% of GDP in 2014. On the one hand, this high health care costs can be explained by the population growth, particularly the elderly proportion, requiring higher investments to secure quality of life. On the other hand, new medicines are continously being discovered enabling the treatment of diseases that were once a sentence of death. This has as a consequence that many once fatal diseases have now become chronic with a high burden on the health care costs.\nBut how can governments and insurers make sure that patients receive the care they need, including latest technology advances, without bankrupting the system? One first step is the identification of high-cost diseases and drugs. This insights can then be used to identify population segments at high-risk of developing a disease, who can then be the focus of prevention measures.\nGovernments, insurers, patient organizations, pharmaceutical and biotech companies need all to leverage their available data, if we are to improve the health of patients now and in future generations.\nMethods Data sources Medicare Drug Spending Data 2011-2015: drug spending and utilization data. In this analysis only Medicare Part D drugs were considered (drugs patients generally administer themselves) Therapeutic Targets Database: Drug-to-disease mapping with ICD identifiers. Tools pandas for data crunching fuzzywuzzy for fuzzy logic matching git for version control Data preprocessing First, I cleaned up and processed the drug spending data available from Medicare for the years 2011-2015. This data includes the total spending, claim number, and beneficiary number \u0026ndash;among others\u0026ndash; for each drug identified by its brand and generic names.\nimport numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_palette(\u0026#39;Paired\u0026#39;) sns.set_style(\u0026#39;whitegrid\u0026#39;) %matplotlib inline import warnings warnings.filterwarnings(\u0026#39;ignore\u0026#39;) data = pd.read_csv(\u0026#39;data/medicare_data_disease.csv\u0026#39;) data.head() I also processed the data from the Therapeutic Targets Database, which presents the indications (diseases) associated with a drug generic name.\ndiseases = pd.read_csv(\u0026#39;data/drug-disease_keys.csv\u0026#39;) diseases.head() Then, I used a fuzzy logic algorithm to match each drug generic name of the Medicare data with the closest element from the Therapeutic Targets Database. After having a list of exact matches, I assigned the first associated indication to each Medicare drug. For details on how I did this, please check my github repository.\nResults Figure 1: Most expensive drugs and indications by total spending in a 5-year interval spending = data.groupby(\u0026#39;Indication\u0026#39;).sum().sort_values(by=\u0026#39;Total Spending\u0026#39;, ascending=False) spending.head() spending_drug = data.groupby(\u0026#39;Brand Name\u0026#39;).sum().sort_values(by=\u0026#39;Total Spending\u0026#39;, ascending=False) spending_drug.head() n_top = 40 fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,8)) g = sns.barplot(x=\u0026#39;Total Spending\u0026#39;, y=\u0026#39;Indication\u0026#39;, data=spending.reset_index()[:n_top], estimator=np.sum, ax=ax1, color=sns.xkcd_rgb[\u0026#39;dodger blue\u0026#39;]) g.set(yticklabels=[i[:27] for i in spending[:n_top].index]) g.set_xlabel(\u0026#39;Total Spending $\u0026#39;) g2 = sns.barplot(x=\u0026#39;Total Spending\u0026#39;, y=\u0026#39;Brand Name\u0026#39;, data=spending_drug.reset_index()[:n_top], estimator=np.sum, ax=ax2, color=\u0026#39;lightblue\u0026#39;) g2.set(yticklabels=[i[:20] for i in spending_drug[:n_top].index]) g2.set_xlabel(\u0026#39;Total Spending $\u0026#39;) #plt.title(\u0026#39;Top 50 indications by Beneficiary Count Sum from 2011 to 2015\u0026#39;) fig.suptitle(\u0026#39;Top %s indications and drugs for 5-year total spending 2011-2015\u0026#39; %n_top, size=16) plt.tight_layout() fig.subplots_adjust(top=0.94) plt.savefig(\u0026#39;Top_%s_disease_drug.png\u0026#39; %n_top, dpi=300, bbox_inches=\u0026#39;tight\u0026#39;) Indications (left part) A look at the total spending for the 5-year period 2011-2015 reveals that the bulk of drug spending is covered by a small set of diseases/indications (left graph). The total spending per indication decreases rapidly by going down the list of drugs.\nDiabetes occupies the first place in this list with a total 5-year spending exceding $50 billion. Following in the list, we find other chronic diseases such as schizophrenia, chronic obstructive pulmonary disease, hypertension (high blood pressure), hypercholesterolemia (high cholesterol), depression, hiv infections, multiple sclerosis, peptic ulcer disease, and chronic HCV infection (hepatitis C). Interestingly, pain medications are also in the top 4 indications by total spending.\nIt makes sense that treatment of chronic diseases receives the highest investment in drug spending, as patients with these diseases can live long lives when medicated.\nInterestingly, the first cancer reaches only the 19th place of this list (chronic myelogenous leukemia). However, it must be noted that cancer is actually a collection of different diseases with different genetics, origin, and treatment options. These different cancers were not clustered in this analysis.\nDrugs (right part) When we look at the most expensive drugs for the total 5-year spending, we find on the top of the list: Lantus (insulin), nexium (peptic ulcer), and crestor(anti cholesterol). It makes sense as these are medicines to treat chronic diseases.\nHowever, we cannot learn much on a high level from looking at the total spending only. Therefore, a closer look is needed.\nFigure 2: Drug spending is growing but at very heterogeneous rates spend_2015_ind = data[data[\u0026#39;Year\u0026#39;]==2015].groupby(\u0026#39;Indication\u0026#39;).sum().sort_values(by=\u0026#39;Total Spending\u0026#39;, ascending=False) #spend_2015_drug = data[data[\u0026#39;Year\u0026#39;]==2015].groupby(\u0026#39;Brand Name\u0026#39;).sum().sort_values(by=\u0026#39;Total Spending\u0026#39;, # ascending=False) spend_2015_ind.head() top_10_spend = data[data[\u0026#39;Year\u0026#39;]==2015].sort_values(by=\u0026#39;Total Spending\u0026#39;, ascending=False)[[\u0026#39;Brand Name\u0026#39;, \u0026#39;Total Spending\u0026#39;, \u0026#39;Year\u0026#39;]][:10] top_10_spend fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=False, figsize=(8,5)) g=sns.factorplot(x=\u0026#39;Year\u0026#39;, y=\u0026#39;Total Spending\u0026#39;, hue=\u0026#39;Brand Name\u0026#39;, palette=\u0026#39;coolwarm\u0026#39;, hue_order=top_10_spend[\u0026#39;Brand Name\u0026#39;], data=data[data[\u0026#39;Brand Name\u0026#39;].isin(top_10_spend[\u0026#39;Brand Name\u0026#39;])], ax=ax1) ax1.set_title(\u0026#39;Annual spending for top 10 drugs\u0026#39;) ax1.set_ylabel(\u0026#39;Total Spending $\u0026#39;) plt.close(g.fig) ax2.scatter(x=spend_2015_ind[\u0026#39;Beneficiary Count\u0026#39;][:100], y=spend_2015_ind[\u0026#39;Total Spending\u0026#39;][:100], s=spend_2015_ind[\u0026#39;Claim Count\u0026#39;][:100]/100000, #c=spend_2015_ind.reset_index()[\u0026#39;Indication\u0026#39;][:100]) color=sns.xkcd_rgb[\u0026#39;dodger blue\u0026#39;], alpha=0.75) ax2.set_title(\u0026#39;Top 100 indications in 2015\u0026#39;) plt.xlabel(\u0026#39;Beneficiary Count\u0026#39;) plt.ylabel(\u0026#39;Total Spending $\u0026#39;) plt.axis([0, None, 0, None]) for label, x, y in zip(spend_2015_ind.index, spend_2015_ind[\u0026#39;Beneficiary Count\u0026#39;][:10], spend_2015_ind[\u0026#39;Total Spending\u0026#39;][:10]): plt.annotate(label, xy=(x, y), color=\u0026#39;red\u0026#39;, alpha=0.75) fig.suptitle(\u0026#39;Annual drug spending development and overview of highest-cost indications\u0026#39;, size=16) plt.tight_layout() fig.subplots_adjust(top=0.85) plt.savefig(\u0026#39;Top_bubble_disease_drug.png\u0026#39;, dpi=300, bbox_inches=\u0026#39;tight\u0026#39;) Annual spending development for top 10 drugs (left) The drug landscape is not temporally static. Therefore, I analyzed the annual spending since 2011 for the 10 top drugs in 2015. Eight out of these ten drugs consistently received higher spending every year, a reflection of the general health care spending panorama. However, the rate of growth for each drug is dramatically different. Particularly striking is the case of the drug Harvoni, which exhibited a \u0026gt;7-fold growth in total spending between 2014 and 2015.\nHarvoni is a medicine for the treatment of hepatitis C (HCV infection) that was launched in 2014. It is the first drug with cure rates close to 100%. Harvoni practically cures a chronic disease and this is reflected in its pricing at over $90k for a 12 week treatment.\nThe remaining drugs in the figure are mostly used for the treatment of chronic diseases.\nBut how can we more extensively evaluate the burden posed by the different diseases/indications?\nTop 100 indications in 2015 (right) In order to find out more about the distribution of the most expensive indications, I plotted the drug spendings grouped by indication for the year 2015 in a scatter plot. This way, we can not only look at the total spending but also at the number of beneficiaries for a particular indication. The size of the bubbles represents the relative number of claims.\nFrom this graph we can assess the magnitude of how the most expensive diseases affect society. Diabetes is not only the most expensive single indication by total spending but also affects a very large number of people.\nThe indications with the most beneficiaries are hypertension, pain and high cholesterol. They also represent some of the highest number of claims (bubble size). This indicates that the average cost associated with each claim is low, as these are generally medications with expired patents that are priced very low.\nAgain it is interesting to take a look at chronic HCV infection. This is the indication for the drug Harvoni. Both the number of beneficiaries and claims are extremely low compared with other diseases. However, chronic HCV infection reached the second place in the highes total drug spending in 2015.\nNext steps I have shown in this analysis that very interesting insights can be gained from analyzing a smaller set of publicly available data. It follows that a more detailed and deeper analysis could enable more targeted conclusions and recommendations for improving the health care system and the quality of life of patients suffering from disease.\nAccess to non-public owned data would make even deeper analysis possible.\nAdditional analysis could include:\nClustering of diseases/indications to higher-level categories (cancer, metabolic disease, circulatory disease, nervous system disease, etc.) Linking of price databases to compare drug costs for the same indication on a population level Analysis of population data registers that document life style characteristics of healthy and ill individuals to identify those at risk of developing high-cost diseases (e.g. Medical Expenditure Panel Survey, Behavioral Risk Factor Surveillance System data) Limitations One limitation from this analysis is that only Part D drugs were considered. A further analysis could include Part B drugs too.\nMoreover it was assumed that the fuzzy logic matching was successful in most cases. A more detailed test is required to assess match success more stringently.\nAll conclusions are only valid for the 2011-2015 interval. No data for 2016 was analyzed.\n","permalink":"https://arcosdiaz.com/archive/2017-02-06-medicare-drug-cost/","summary":"\u003cp\u003eHealth care systems world-wide are under pressure due to the high costs associated with disease. Now more than ever, particularly in developed countries, we have access to the latest advancements in medicine. This contrasts with the challenge of making those treatments available to as many patients as possible. It is imperative to find ways maximize the positive impact on the quality of life of patients, while maintaining a sustainable health care system. For this purpose I performed an analysis of Medicare data in the USA. Furthermore I used a drug-disease open database to cluster the costs by disease. I identified the most expensive diseases (mostly chronic diseases such as Diabetes) and the most expensive medicines. A drug for the treatment of HCV infections (Harvoni) stands out with the highest total costs in 2015. After this first exploration, I propose the in-depth analysis of further data to enable more targeted conclusions and recommendations to improve health care, such as linking of price databases to compare drug costs for the similar indications or the analysis of population data registers that document life style characteristics of healthy and sick individuals to identify those at risk of developing high-cost diseases.\u003c/p\u003e","title":"Exploratory analysis of Medicare drug cost data 2011-2015"},{"content":"Do movie releases produce literal earthquakes? We always hear about new movie releases being a \u0026ldquo;blast\u0026rdquo;, some sure are. But how do two independent events correlate with each other? In this post, I will use Python to visualize two different series of events, plotting them on top of each other to gain insights from time series data.\n# Imports from datetime import datetime import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_palette(\u0026#39;Set2\u0026#39;) sns.set_style(\u0026#34;whitegrid\u0026#34;) %matplotlib inline Getting the data To make this example more fun, I decided to use two independent series of events for which data is readily available in the internet:\nList of earthquakes around the world List of film releases in the USA Clean and prepare earthquake data We start by downloading the .csv export from the world earthquake website to the \u0026lsquo;data\u0026rsquo; directory and reading the file into a pandas DataFrame\ndf = pd.read_csv(\u0026#39;data/earthquakes_raw.csv\u0026#39;, sep=\u0026#39;;\u0026#39;) df.dropna((0,1), how=\u0026#39;all\u0026#39;, inplace=True) df.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } We have to unify the date information from the Date and Year columns. Then we can save the cleaned-up earthquake date data to a file \u0026lsquo;data/earthquakes.csv\u0026rsquo;\ndf[\u0026#39;Date\u0026#39;] = df[\u0026#39;Date\u0026#39;] + \u0026#39; \u0026#39; + df[\u0026#39;Year\u0026#39;].map(str) del df[\u0026#39;Year\u0026#39;] df[\u0026#39;Date\u0026#39;] = df[\u0026#39;Date\u0026#39;].apply(lambda x: datetime.strptime(x, \u0026#39;%B %d %Y\u0026#39;)) df[\u0026#39;Date\u0026#39;].to_csv(\u0026#39;data/earthquakes.csv\u0026#39;) df.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Clean and prepare movie release data The movie release data was retrieved from this website and saved to the \u0026lsquo;data\u0026rsquo; directory. We then read the file into a pandas DataFrame. The resulting table tells us the release date and which movies where released on a that date (up to 5 movies).\ndf = pd.read_csv(\u0026#39;data/filmrelease_raw.csv\u0026#39;, sep=\u0026#39;;\u0026#39;, header=None) df.dropna((0,1), how=\u0026#39;all\u0026#39;, inplace=True, thresh=2) df.columns = [\u0026#39;Date\u0026#39;, \u0026#39;Film1\u0026#39;, \u0026#39;Film2\u0026#39;, \u0026#39;Film3\u0026#39;, \u0026#39;Film4\u0026#39;, \u0026#39;Film5\u0026#39;] df.head(), df.tail() ( Date Film1 Film2 \\ 0 Friday, January 9, 2015 Taken 3 NaN 4 Friday, January 16, 2015 Blackhat Paddington 14 Friday, January 23, 2015 Mortdecai Strange Magic 24 Friday, January 30, 2015 Black or White Project Almanac 34 Friday, February 6, 2015 Jupiter Ascending Seventh Son Film3 Film4 Film5 0 NaN NaN NaN 4 The Wedding Ringer NaN NaN 14 The Boy Next Door NaN NaN 24 The Loft NaN NaN 34 SpongeBob Movie: Sponge Out of Water NaN NaN , Date Film1 Film2 \\ 836 Friday, December 9 🎥 Office Christmas Party NaN 840 Friday, December 16 🎥 Collateral Beauty 🎥 La La Land 850 Wednesday, December 21 🎥 Assassin's Creed 🎥 Passengers 863 Friday, December 23 🎥 Why Him? NaN 867 Sunday, December 25 🎥 Fences NaN Film3 Film4 Film5 836 NaN NaN NaN 840 🎥 Rogue One: A Star Wars Story NaN NaN 850 🎥 Patriots Day 🎥 Sing NaN 863 NaN NaN NaN 867 NaN NaN NaN ) Talk about raw unclean data! It seems that, at the top of the table, the date information contains the year (2015). However, upon further inspection we can see that the bottom of the table does not show us the year anymore. From the website information we find out that, after the index 716 and onwards, the missing year information is \u0026lsquo;2016\u0026rsquo;. So we add this data to the DataFrame and change the date format to a more readable one.\ndf.loc[lambda x: x.index \u0026gt;= 716, \u0026#39;Date\u0026#39;] += \u0026#39;, 2016\u0026#39; df[\u0026#39;Date\u0026#39;] = df[\u0026#39;Date\u0026#39;].apply(lambda x: datetime.strptime(x, \u0026#39;%A, %B %d, %Y\u0026#39;)) df.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } For the purpose of plotting the frequency of an event, we are not interested in what movies were released, but only in how many on a particular date. We can count the movies by replacing the names with ones and calculating the sum.\n# replace movie names with ones df.iloc[:,1:] = df.iloc[:,1:].replace(r\u0026#39;\\w\u0026#39;, 1.0, regex=True) df.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } We can finally get rid of the unnecessary columns and save the clean data to a file. Now we are ready to start plotting our event series data.\ndf[\u0026#39;film_sum\u0026#39;] =df[[\u0026#39;Film1\u0026#39;, \u0026#39;Film2\u0026#39;, \u0026#39;Film3\u0026#39;, \u0026#39;Film4\u0026#39;, \u0026#39;Film5\u0026#39;]].sum(axis=1) df.drop([\u0026#39;Film1\u0026#39;, \u0026#39;Film2\u0026#39;, \u0026#39;Film3\u0026#39;, \u0026#39;Film4\u0026#39;, \u0026#39;Film5\u0026#39;], axis=1, inplace=True) df.to_csv(\u0026#39;data/filmrelease.csv\u0026#39;) df.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Load the data to the plotting variables To use this script, we have to load the clean data that we saved in the previuos steps.\n# Load the earthquake data and add a column with ones since there was only one earthquake per row df1 = pd.read_csv(\u0026#39;data/earthquakes.csv\u0026#39;, header=None) del df1[0] df1.columns = [\u0026#39;Date\u0026#39;] df1[\u0026#39;earthquake\u0026#39;] = np.ones(len(df1)) df1.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } # Load the movie data, the second column already shows us the sum of movie releases df2 = pd.read_csv(\u0026#39;data/filmrelease.csv\u0026#39;, header=0) del df2[\u0026#39;Unnamed: 0\u0026#39;] df2.columns = [\u0026#39;Date\u0026#39;, \u0026#39;movie_release\u0026#39;] df2.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } As end result, we want to have a single DataFrame containing the data for both event series. Moreover, we want to have a continuous time series, including those days in which none of the two events took place (no earthquakes, and no movie releases). We do this by using the concatenate and resample functions of pandas.\n# Concatenate both DataFrames into one df = pd.concat([df1, df2], ignore_index=True) df = df.set_index(pd.DatetimeIndex(df.Date)) df = df.sort_index() df = df.resample(\u0026#39;1d\u0026#39;).sum().fillna(0) # to complete every day df.head(10) .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } Calculating and plotting a moving average We could simply plot each event occurrence as a data point in a time series. However, this will likely yield a graph that is not very informative. Much easier to grasp is a moving average that tells us the average frequency of the events for a defined period of time in the past. We can create columns for these moving averages, which we can then easily plot.\n# Calculate moving average for i in [7*4, 7*4*2]: mvav = i # moving average period, i.e. number of points to average dfi = np.convolve(df[\u0026#39;earthquake\u0026#39;], np.ones((mvav,))*7/mvav # factor for obtaining average , mode=\u0026#39;full\u0026#39;) df[\u0026#39;earthquake moving average %sw\u0026#39; % (int(i/7))] = dfi[:-(i-1)] dfj = np.convolve(df[\u0026#39;movie_release\u0026#39;], np.ones((mvav,))*7/mvav # factor for obtaining average , mode=\u0026#39;full\u0026#39;) df[\u0026#39;movie_release moving average %sw\u0026#39; % (int(i/7))] = dfj[:-(i-1)] df.head() .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } The shorter the period that we choose for the moving average, the noisier our graphic will get. Let\u0026rsquo;s settle with a moving average that reflects the frequency during the past 8 weeks. And voilà! Now we can see how the frequency of earthquakes and movie releases changed over time.\n# Plot relevant columns from dataframe df.loc[:,[\u0026#39;earthquake moving average 8w\u0026#39;, \u0026#39;movie_release moving average 8w\u0026#39;]].\\ plot(cmap=\u0026#39;Set2\u0026#39;, figsize=(12,4))# possible plt.xlim(df.index[0], df.index.max()+10) plt.title(\u0026#39;Moving average of events per week\u0026#39;) plt.ylabel(\u0026#39;Frequency\u0026#39;) plt.show() Descriptive analysis of the event occurrence What other insights can we get from this data set? Two very dissimilar series of events, one natural, and one man-made, will surely have very different properties. Let\u0026rsquo;s start with a simple question: on which days of the week do both events typically happen?\n#%% DAY OF THE WEEK ANALYSIS # create column for day of the week df[\u0026#39;Day\u0026#39;] = df.index.dayofweek df[\u0026#39;Day\u0026#39;] = df.Day.astype(\u0026#39;category\u0026#39;) df.Day.cat.categories = [\u0026#39;Mon\u0026#39;,\u0026#39;Tue\u0026#39;, \u0026#39;Wed\u0026#39;, \u0026#39;Thu\u0026#39;, \u0026#39;Fri\u0026#39;, \u0026#39;Sat\u0026#39;, \u0026#39;Sun\u0026#39;] # create column for type df[\u0026#39;Type\u0026#39;] = np.where(df[\u0026#39;earthquake\u0026#39;]\u0026gt;0, \u0026#39;earthquake\u0026#39;, np.where(df[\u0026#39;movie_release\u0026#39;]\u0026gt;0,\\ \u0026#39;movie_release\u0026#39;, np.nan)) df[\u0026#39;Type\u0026#39;] = df[\u0026#39;Type\u0026#39;].astype(\u0026#39;category\u0026#39;) df[\u0026#39;Type\u0026#39;] = df[\u0026#39;Type\u0026#39;].cat.remove_categories([\u0026#39;nan\u0026#39;]) # show count for each day #df[(df.Event1 == 1)\u0026amp;(df.Day == \u0026#39;Mon\u0026#39;)].Day.count() # plot count data per day of the week plt.figure() plt.title(\u0026#39;Event count per day of the week\u0026#39;) sns.countplot(data=df, x=\u0026#39;Day\u0026#39;, hue=\u0026#39;Type\u0026#39;) sns.despine(left=True) plt.show() We can see that nature does not respect our weekends, as earthquakes seem to be flatly distributed by day of the week.\nThe movie releases, on the other hand, are most frequent on Friday, followed by Wednesday and a few on Sunday (it\u0026rsquo;s almost as if movie release days were chosen by someone\u0026hellip; /s). It seems that, if you\u0026rsquo;re planning to release a new movie in the US, Friday is the way to go. People are usually happy to start the weekend with a leisurely activity, so that makes sense. As of why Saturdays and Sundays are almost not used as movie release days, even though on this days people are also usually free from work, it would be interesting to find out why. Another intriguing finding is the not high but remarkable number of releases on Wednesdays. Don\u0026rsquo;t people work on Thursdays?\nAnalysis of frequency per week The world is a big place (or is it a small world?) and earthquakes occur all the time, even though we might not always find out. On the other hand, I would expect that movie releases occur much more frequently. So let\u0026rsquo;s take a look at the data by plotting histograms for both events side by side.\n# joint histograms plt.figure() df[\u0026#39;earthquake moving average 8w\u0026#39;].hist(alpha=.9) df[\u0026#39;movie_release moving average 8w\u0026#39;].hist(alpha=.9) plt.title(\u0026#39;Histogram of event frequency per week\u0026#39;) plt.show() # With seaborn sns.distplot(df[\u0026#39;earthquake moving average 8w\u0026#39;]), \\ sns.distplot(df[\u0026#39;movie_release moving average 8w\u0026#39;]) (\u0026lt;matplotlib.axes._subplots.AxesSubplot at 0x11eb49198\u0026gt;, \u0026lt;matplotlib.axes._subplots.AxesSubplot at 0x11eb49198\u0026gt;) Luckily, movie releases are much more frequent per week as earthquakes. On most weeks, there are between two and three movie releases, compared to 0.5 to 1.5 earthquakes.\nFinal remarks In this post, we gathered information on the occurrence of two events: earthquakes around the world, and movie releases in the US. By plotting their moving averages we could better compare when they occurred and gained some interesting insights about how they compare. All thanks to Python!\n","permalink":"https://arcosdiaz.com/archive/2017-02-06-event-tracker/","summary":"\u003cp\u003eDo movie releases produce literal earthquakes? We always hear about new movie releases being a \u0026ldquo;blast\u0026rdquo;, some sure are. But how do two independent events correlate with each other? In this post, I will use Python to visualize two different series of events, plotting them on top of each other to gain insights from time series data.\u003c/p\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cpre tabindex=\"0\" style=\"color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;\"\u003e\u003ccode class=\"language-python\" data-lang=\"python\"\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#75715e\"\u003e# Imports\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003efrom\u003c/span\u003e datetime \u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e datetime\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e numpy \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e np\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e pandas \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e pd\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e matplotlib.pyplot \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e plt\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003eimport\u003c/span\u003e seaborn \u003cspan style=\"color:#66d9ef\"\u003eas\u003c/span\u003e sns\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003esns\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eset_palette(\u003cspan style=\"color:#e6db74\"\u003e\u0026#39;Set2\u0026#39;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003esns\u003cspan style=\"color:#f92672\"\u003e.\u003c/span\u003eset_style(\u003cspan style=\"color:#e6db74\"\u003e\u0026#34;whitegrid\u0026#34;\u003c/span\u003e)\n\u003c/span\u003e\u003c/span\u003e\u003cspan style=\"display:flex;\"\u003e\u003cspan\u003e\u003cspan style=\"color:#f92672\"\u003e%\u003c/span\u003ematplotlib inline\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/div\u003e\u003ch1 id=\"getting-the-data\"\u003eGetting the data\u003c/h1\u003e\n\u003cp\u003eTo make this example more fun, I decided to use two independent series of events for which data is readily available in the internet:\u003c/p\u003e","title":"Visualizing parallel event series in Python"},{"content":"Being able to see the future would be a great superpower (or so one would think). Luckily, it is already possible to model the future using Python to gain insights into a number of problems from many different areas. In marketing, being able to model how successful a new product will be, would be of great use. In this post, I will take a look at how we can model the future revenue of a product by making certain assumptions and running a Monte Carlo Markov Chain simulation.\nWhat are Monte Carlo methods? Wikipedia tells us that:\nMonte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. Their essential idea is using randomness to solve problems that might be deterministic in principle.\nIn simple terms, we define a number of rules about how a system will behave based on assumptions, and then use random samplings of these conditions over and over and measure the results. We can then look at the results altogether to gain insights into our model.\nLet\u0026rsquo;s see this in practice!\nimport numpy as np from pylab import triangular, zeros, percentile from scipy.stats import binom import pandas as pd import seaborn as sns sns.set_palette(\u0026#39;coolwarm\u0026#39;) sns.set_style(\u0026#34;whitegrid\u0026#34;) import matplotlib.pyplot as plt %matplotlib inline /Users/dario/anaconda/envs/tensorflow/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') /Users/dario/anaconda/envs/tensorflow/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.') Define the initial assumptions What assumptions can we safely make regarding our new product? For example, what is the estimated market size that we want to work with and what is the estimated price that we can expect. We also define the num_timesteps, the number of years for which we want to make the calculations.\n# initial market size assumption (total number of potential users) marketsize_min = 5000 marketsize_mode = 12000 marketsize_max = 15000 marketshare_init = triangular(.003, .005, .01) # min, mode, max # initial percentage of users that use the product price_min=500 # minimum product price price_mode=1000 # mode product price price_max=1500 # maximum product price num_timesteps=10 # number of years for the simulation num_simulations=1024 # number of simulations per year perc_selection = [5, 25, 50, 75, 95] # percentiles to visualize in plots Define the functions to calculate market share and revenue of a product These are the functions used to get the data points by random sampling. Each time we run each function, the variables are randomly defined from a range and a result is obtained, e.g. a market share or a revenue amount.\ndef calc_marketshare(marketsize, marketshare): \u0026#39;\u0026#39;\u0026#39; Calculates product market share for a given year as percentage of users that use the product compared to total number of users Arguments: marketsize : total market size as number of potential users marketshare: observed/assumed percentage of users that use the product \u0026#39;\u0026#39;\u0026#39; share = binom.rvs(marketsize, marketshare, size=1) / marketsize return share def calc_revenue(marketsize, marketshare): \u0026#39;\u0026#39;\u0026#39; Calculates the revenue development over a number of years Arguments: marketsize: total market size as number of potential users marketshare : observed/assumed percentarge of users that use the product \u0026#39;\u0026#39;\u0026#39; product_price = triangular(price_min, price_mode, price_max) volume = marketsize*marketshare revenue = product_price * volume return revenue Additionally, in case that a distribution is not included in the standard statistical modules of Python, we can custom write them. For example, we can define functions to return logistic and sigmoid distributions.\ndef logist(x, loc, scale, factor): \u0026#39;\u0026#39;\u0026#39; Logistic distribution Args: x : variable in x-axis, e.g. time loc : the mean of the distribution, maximum probability scale : steepness of the curve, higher -\u0026gt; steeper factor : multiplies to obtain higher probabilities overall \u0026#39;\u0026#39;\u0026#39; return factor*np.exp((loc-x)/scale)/(scale*(1+np.exp((loc-x)/scale))**2) def sigmoid(x): L, q, loc = 10, 1, 3 return L/(1+np.exp(-q*(x-loc))) Why do we need this logistic distribution? For example, if we want to take into account the market growth in the next ten years, we could simply assume it will be 1% or 2% or 10% and keep it constant. However, we have Python on our side and can rather model this growth in a semi-random way. We assume that the market growth is more likely to be lower (between 0 and 4%) but we want to also consider the lower probability cases in which the growth could be higher, e.g. 8%.\ndef logist_test(x): loc, scale = 2, 2 return 4*np.exp((loc-x)/scale)/(scale*(1+np.exp((loc-x)/scale))**2) x = np.arange(0,10) plt.plot(logist_test(x)) #plt.plot(bins, logist(bins, loc, scale)*count.max()/logist(bins, loc, scale).max()) plt.show() Data collection and simulation Now that we have all assumptions and \u0026ldquo;rules\u0026rdquo; in place, let\u0026rsquo;s get some data points.\nFirst let\u0026rsquo;s create some empty matrixes where we will put the data later.\nu = zeros((num_simulations,), dtype=float) # temporary market size matrix as number of potential users s = zeros((num_simulations,), dtype=float) # temporary market share matrix r = zeros((num_simulations,), dtype=float) # temporary revenue matrix rev = zeros((num_timesteps, num_simulations), dtype=float) # revenue data collection by year percentiles_rev = zeros((num_timesteps,len(perc_selection)), dtype=float) # percentiles_rev data collection by year usr = zeros((num_timesteps, num_simulations), dtype=float) # users data collection by year percentiles_usr = zeros((num_timesteps,len(perc_selection)), dtype=float) # percentiles for total users sha = zeros((num_timesteps, num_simulations), dtype=float) # market share data collection by year percentiles_sha = zeros((num_timesteps,len(perc_selection)), dtype=float) # percentiles for market share Now we can run the simulations to get our data points for the next 10 years. The results are captured in the pre-created matrices.\nfor t in range(0, num_timesteps): if t==0: # First year starting with initial assumptions for k in range(num_simulations): u[k] = triangular(marketsize_min,marketsize_mode,marketsize_max) # triangular distribution of current number of potential users s[k] = calc_marketshare(u[k], marketshare_init) # market share for product r[k] = calc_revenue(u[k], s[k]) # revenue # store values in first row of matrices: rev[t,:] += r usr[t,:] += u sha[t,:] = s #percentiles of the complete revenue row at time t percentiles_rev[t,:] = percentile(rev[t,:], perc_selection) percentiles_usr[t,:] = percentile(usr[t,:], perc_selection) percentiles_sha[t,:] = percentile(sha[t,:], perc_selection) else: # Following years starting with the previous year\u0026#39;s data for k in range(num_simulations): # estimate how much the market has grown: loc = triangular(1, 2, 4) scale = triangular(1, 2, 3) factor = 3 marketgrowth = logist(t, loc, scale, factor) u[k] += u[k] * marketgrowth # apply market growth s[k] = calc_marketshare(u[k], s[k]) + logist(t, 4, 5, 1) # apply market share increase r[k] = calc_revenue(u[k], s[k]) # calculate revenue # store values in following rows of matrices rev[t,:] = rev[t-1,:] + r usr[t,:] += u sha[t,:] = s #percentiles of the complete revenue row at time t percentiles_rev[t,:] = percentile(rev[t,:], perc_selection) percentiles_usr[t,:] = percentile(usr[t,:], perc_selection) percentiles_sha[t,:] = percentile(sha[t,:], perc_selection) Revenue simulation plots Having captured all our data, we can now plot it to see how the variable of interest, in this case the revenue of the new product, develops in the next 10 years.\nFirst we print the percentiles to get the numeric data:\n# Print the percentiles of revenue df = pd.DataFrame(percentiles_rev, columns=[\u0026#39;5%\u0026#39;,\u0026#39;25%\u0026#39;,\u0026#39;50%\u0026#39;,\u0026#39;75%\u0026#39;,\u0026#39;95%\u0026#39;]) df Now we can plot these percentiles of revenue in an aggregated form.\n# Plot the percentiles of revenue x = np.arange(0,10) df.plot(kind=\u0026#39;line\u0026#39;, color=\u0026#39;black\u0026#39;, linewidth=0.2) plt.fill_between(x,df[\u0026#39;25%\u0026#39;].values,df[\u0026#39;75%\u0026#39;].values, color=\u0026#39;grey\u0026#39;, alpha=0.6) plt.fill_between(x,df[\u0026#39;5%\u0026#39;].values,df[\u0026#39;95%\u0026#39;].values, color=\u0026#39;grey\u0026#39;, alpha=0.4) plt.title(\u0026#34;Revenue percentiles over %s years\u0026#34; %num_timesteps) plt.show() We can also plot the individual \u0026ldquo;random walks\u0026rdquo; of the simulation just for fun.\n# Plot the random walks for revenue df2=pd.DataFrame(rev) df2.plot(kind=\u0026#39;line\u0026#39;, legend=False, alpha=.03) plt.title(\u0026#34;Revenue random walks over %s years\u0026#34; %num_timesteps) plt.show() Market share simulation plots Similarly, let\u0026rsquo;s plot our simulation results for the market share calculations\n# Print the percentiles of market size df_usr = pd.DataFrame(percentiles_usr, columns=[\u0026#39;5%\u0026#39;,\u0026#39;25%\u0026#39;,\u0026#39;50%\u0026#39;,\u0026#39;75%\u0026#39;,\u0026#39;95%\u0026#39;]) #print(df) # Plot the percentiles market size x = np.arange(0,10) df_usr.plot(kind=\u0026#39;line\u0026#39;, color=\u0026#39;w\u0026#39;) plt.fill_between(x,df_usr[\u0026#39;25%\u0026#39;].values,df_usr[\u0026#39;75%\u0026#39;].values, color=\u0026#39;grey\u0026#39;, alpha=0.6) plt.fill_between(x,df_usr[\u0026#39;5%\u0026#39;].values,df_usr[\u0026#39;95%\u0026#39;].values, color=\u0026#39;grey\u0026#39;, alpha=0.4) plt.title(\u0026#34;Market size percentiles over %s years\u0026#34; %num_timesteps) plt.show() # Plot the random walks for market size df2=pd.DataFrame(usr) df2.plot(kind=\u0026#39;line\u0026#39;, legend=False, alpha=.03) plt.title(\u0026#34;Market size random walks over %s years\u0026#34; %num_timesteps) plt.show() Product revenue and market size distribution Finally, we can visualize how the revenue is distributed in our simulation for a particular year using histograms. For example, let\u0026rsquo;s plot the distribution of revenue:\nax1 = plt.subplot(111) ax1 plt.title(\u0026#34;Product revenue, price mode %s €\u0026#34; %price_mode) plt.hist(rev[0], bins=50, range=(0, r.max()), label=\u0026#39;year 1\u0026#39;) plt.hist(rev[2], bins=50, range=(0, r.max()), label=\u0026#39;year 3\u0026#39;) plt.hist(rev[4], bins=50, range=(0, r.max()), label=\u0026#39;year 5\u0026#39;)#axis([0,width,0,height]) plt.hist(rev[6], bins=50, range=(0, r.max()), label=\u0026#39;year 7\u0026#39;) plt.legend() plt.show() Of course, the farther in the future our model, the wider the distribution, as our model gets more and more uncertain.\nWe can do the same with the market size distribution:\nax2 = plt.subplot(111) ax2 plt.title(\u0026#34;Market size, price mode %s €\u0026#34; %price_mode) #hist(c, bins=50, range=(0, c.max()), ) plt.hist(usr[0], bins=50, range=(0, u.max()), label=\u0026#39;year 1\u0026#39;) plt.hist(usr[2], bins=50, range=(0, u.max()), label=\u0026#39;year 3\u0026#39;) plt.hist(usr[4], bins=50, range=(0, u.max()), label=\u0026#39;year 5\u0026#39;) plt.hist(usr[6], bins=50, range=(0, u.max()), label=\u0026#39;year 7\u0026#39;) plt.show() Final remarks In this post, we saw how we can use Python to model a simple Monte Carlo simulation and how we can plot these results to look at forecasting from a different perspective.\n","permalink":"https://arcosdiaz.com/archive/2016-10-15-product-revenue-forecast/","summary":"\u003cp\u003eBeing able to see the future would be a great superpower (or so one would think). Luckily, it is already possible to \u003cem\u003emodel\u003c/em\u003e the future using Python to gain insights into a number of problems from many different areas. In marketing, being able to model how successful a new product will be, would be of great use. In this post, I will take a look at how we can model the future revenue of a product by making certain assumptions and running a Monte Carlo Markov Chain simulation.\u003c/p\u003e","title":"Simulating the revenue of a product with Monte-Carlo random walks"},{"content":"Hi! I\u0026rsquo;m Dario, a Senior Data Scientist at BASF with a PhD in Molecular Neuroscience (University of Heidelberg / Max Planck Institute). I work at the intersection of machine learning and the life sciences.\nWhat I do: I build ML solutions for complex data, including graph-based simulations of production networks, knowledge graphs integrating multi-source data, and genomic foundation-model applications. Before BASF, I spent three years at IBM developing graph analytics, EHR-based predictive models, and anomaly-detection systems.\nWhere I\u0026rsquo;m headed: I\u0026rsquo;m expanding into computational biology, bioinformatics, and AI-driven scientific discovery. Recent projects include a brain region classifier trained on GTEx RNA-seq data (95% accuracy; top features map to known neuroscience markers), and a multi-agent system for drug-target discovery that autonomously queries Open Targets, UniProt, ChEMBL, and PubMed to identify and rank therapeutic targets for a given disease. I\u0026rsquo;m also completing coursework in sequence alignment, NGS, and variant prediction.\nBackground: I started university at 15 and finished my PhD at 25. My research at the Max Planck Institute focused on glutamate receptors and memory formation using AAV-based tools. I\u0026rsquo;ve contributed to publications in the EMBO Journal and at the AMIA symposium.\nI love languages and speak English, German, Spanish, French, and Italian. I also have HSK3/4 level in Mandarin Chinese.\nLinkedIn · GitHub\n","permalink":"https://arcosdiaz.com/about/","summary":"\u003cp\u003eHi! I\u0026rsquo;m Dario, a Senior Data Scientist at BASF with a PhD in Molecular Neuroscience (University of Heidelberg / Max Planck Institute). I work at the intersection of machine learning and the life sciences.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eWhat I do:\u003c/strong\u003e I build ML solutions for complex data, including graph-based simulations of production networks, knowledge graphs integrating multi-source data, and genomic foundation-model applications. Before BASF, I spent three years at IBM developing graph analytics, EHR-based predictive models, and anomaly-detection systems.\u003c/p\u003e","title":"About Me"}]