Data Scientology

Hot data science related posts every hour. Chat: https://telegram.me/r_channels
Contacts: @lgyanf

[OC] Top 10 most used words by subreddit in 2022 (reuploaded) https://redd.it/zt0zab @datascientology

Failed to load video

Open in Telegram
6
Is your country in green or slightly different green? /r/dataisugly https://redd.it/ztik6h

Failed to load video

Open in Telegram
29
How Snowflakes are Formed /r/visualization https://redd.it/ztkg3z

Failed to load video

Open in Telegram
37
P App that Determines Whether You've Been Naughty or Nice Based on Your Reddit Comments Hex Application Since we are heading into the holiday season, I thought it would be interesting to take a look if you could create a model to look at morality with user's Reddit comments. I used Scikit-Learn's Logistic Regression Model for this. I started by downloading around 750 comments from Social Grep's website. They have pulled Reddit comments from different sets of subreddits. I pulled from their datasets for confession-like subreddits, the irl subreddits, and the dataset subreddit. I classified the comments manually by a set rule of morality. Once they were scored, I trained/tested the Logistic model with those comments. For the specific user testing, I used PRAW to pull the most recent 50 comments from the username provided in the Hex Application. I ran the trained model and outputted the probability of each comment being nice and took an average of the probabilities and used that value to determine whether the user was naughty or nice. I use a script to email a CSV with all of the tested comments and the final score to the user. Based on the results that have came through so far, the model is definitely biased towards giving the user a nice decision. I believe that is based on the training data being around 70% nice versus naughty. Does anyone have a way to help the model from being biased like that? Feel free to try the app out and let me know what you think! /r/MachineLearning https://redd.it/zspu96

Failed to load video

Open in Telegram
35
Discussion Anyone else having a hard time not getting mad/cringing at the general public anthropomorphizing the hell out of chatGPT? It was one thing with DALLE-2, but at least it couldn’t talk back to them. I mean I have been in board meetings with powerful people in leadership positions that have nothing to do with tech have absolutely horrendous ideas about what ChatGPT is- I am not lying, I have genuinely heard them say they believe it’s basically conscious and using excerpt screenshots of it saying it hates humans as a basis to make business decisions about the future of AI in their company. Like….WHAT? Have other people heard absurd things like this too? I think it’s just hard to see the professional reality of machine learning, becoming extremely debased from the general public idea of machine learning. I’m sure as we all get even better at our jobs it’s only going to get much much worse. I wouldn’t be surprised if soon we are the new magical witches of the world. i’ll see you guys on the pyres in 20 years.( ok really I’m just joking on that last part) What do you all think? /r/MachineLearning https://redd.it/ztbsf5

Failed to load video

Open in Telegram
41
[OC] The cost of Christmas varies widely across the world, from less than $100 to over $2000 /r/dataisbeautiful https://redd.it/ztbovn

Failed to load video

Open in Telegram
46
Wheel of Emotional Granularity [by Abby VanMuijen with Michelle McGhee] /r/Infographics https://redd.it/zs39x4

Failed to load video

Open in Telegram
60
How a spider builds an orb web: generate high tensile strength anchor, bridge & frame threads; thin radius threads, and a sticky capture spiral /r/Infographics https://redd.it/zta47e

Failed to load video

Open in Telegram
57
Sample Peyote: generate multi-table synthetic data on any topic using GPT-3 Last weekend, I created a tool that uses GPT-3 to create synthetic datasets. I call it Sample Peyote, because it hallucinates sample data sets. Here's a Star Wars dataset that it generated. There are several more examples linked from the README on github. Source code is there, too. ​ This was mostly a kick-the-tires project to understand what GPT is capable of, but I wanted it to be based in a real workflow with nontrivial requirements: Start from scratch: Most synthetic data generators work by taking a sample of real data, and generating a fake dataset that has similar properties. I want to generate (aka "hallucinate") data starting from just an idea. Cover any topic: I want to be able to generate data related to many different topics. Generate a database, not just a table: I don't just want to generate a table. I want to generate a realistic-feeling database, with multiple tables and realistic use of things like foreign keys, ENUMs, and timestamps. Pass the **Enhance That! test**: Generate data that "feels authentic." ​ I'd love feedback, and ideas for use cases. /r/datasets https://redd.it/zrr2yr

Failed to load video

Open in Telegram
57
Trinidad and Tobago’s Annual Air Visitor Arrivals /r/Infographics https://redd.it/zsrbss

Failed to load video

Open in Telegram
59
TIL: the Uruguay river is mostly not in Uruguay and it doesn't start there. /r/MapPorn https://redd.it/zt3azv

Failed to load video

Open in Telegram
54
Countries where there were executions in 2022 https://redd.it/zsy3lt @datascientology

Failed to load video

Open in Telegram
48
I made repurrsive Sierpinski triangle cat stacks /r/mathpics https://redd.it/zq6d6r

Failed to load video

Open in Telegram
48
P A self-driving car using Nvidia Jetson Nano, with movement controlled by a pre-trained convolution neural network (CNN) written in Taichi Intro & source code: https://github.com/houkensjtu/taichi-hackathon-akinasan 1. The circuit of an ordinary RC toy car is modified so that Jetson Nano can control the movement of the car through GPIO port. Of course, we need to use motor drive controller here, because the upper limit of the output current of Jetson Nano is not enough to drive the car motor directly. 2. The convolution neural network (CNN) is implemented using Taichi programming language. 3. The road data was collected, then classified and labeled, and finally used in the training of CNN models. 4. The pre-trained model is imported into Jetson Nano and the action prediction made for the images captured during driving. Demo: https://reddit.com/link/zshrlv/video/pcm3f6id3f7a1/player /r/MachineLearning https://redd.it/zshrlv

Failed to load video

Open in Telegram
52
Hey guys, I have 19 days to prepare for a 2hours onsite Data Science Interview. How should I prepare for it to maximise my chances? The company is in the **aerospace and defense sector**. Job Requirements: * Good understanding of Python * Solid bases in statistics and Machine Learning, in particular in forecasting methods * Solid knowledge of database operations (SQL and NoSQL) * A strong taste for creating impactful visualizations (dataviz) * Experience, if possible in an industrial context, with at least one dashboard creation software/library and data visualization libraries (PowerBI, Grafana, Tableau, Dash, Shiny, D3.js, ...) If you have roadmaps to prepare this kind of interviews please share them with me, it's about the job of my dreams and I am willing to work hard to get it. I have some basics in Data Science, statistics, and Probability but I want to start from scratch. /r/datascience https://redd.it/zsvcys

Failed to load video

Open in Telegram
54
D When chatGPT stops being free: Run SOTA LLM in cloud TL;DR: I found GPU compute to be generally cheap and spot or on-demand instances can be launched on AWS for a few USD / hour up to over 100GB vRAM. So I thought it would make sense to run your own SOTA LLM like Bloomz 176B inference endpoint whenever you need it for a few questions to answer. I thought it would still make more sense than shoving money into a closed walled garden like "not-so-OpenAi" when they make ChatGPT or GPT-4 available for $$$. But I struggle due to lack of tutorials/resources. Therefore, I carefully checked benchmarks, model parameters and sizes as well as training sources for all SOTA LLMs here. Knowing since reading the Chinchilla paper that Model Scaling according to OpenAI was wrong and more params != better quality generation. So I was looking for the best performing LLM openly available in terms of quality and broadness to use for multilingual everyday questions/code completion/reasoning similar to what chatGPT provides (minus the fine-tuning for chat-style conversations). My choice fell on Bloomz (because that handles multi-lingual questions well and has good zero shot performance for instructions and Q&A style text generation. Confusingly Galactica seems to outperform Bloom on several benchmarks. But since Galactica had a very narrow training set only using scientific papers, I guess usage is probably limited for answers on non-scientific topics. Therefore I tried running the original bloom 176B and alternatively also Bloomz 176B on AWS SageMaker JumpStart, which should be a one click deployment. This fails after 20min. On Azure ML, I tried using DeepSpeed-MII which also supports bloom but also fails due the instance size of max 12GB vRAM I guess. From my understanding to save costs on inference, it's probably possible to use one or multiple of the following solutions: - Precision: int8 instead of fp16 - Microsoft/DeepSpeed-MII for an up 40x reduction on inference cost on Azure, this thing also supports int8 and fp16 bloom out of the box, but it fails on Azure due to instance size. - facebook/xformer not sure, but if I remember correctly this brought inference requirements down to 4GB vRAM for StableDiffusion and DreamBooth fine-tuning to 10GB. No idea if this is usefull for Bloom(z) inference cost reduction though I have a CompSci background but I am not familiar with most stuff, except that I was running StableDiffusion since day one on my rtx3080 using linux and also doing fine-tuning with DreamBooth. But that was all just following youtube tutorials. I can't find a single post or youtube video of anyone explaining a full BLOOM / Galactica / BLOOMZ inference deployment on cloud platforms like AWS/Azure using one of the optimizations mentioned above, yet alone deployment of the raw model. :( I still can't figure it out by myself after 3 days. TL;DR2: Trying to find likeminded people who are interested to run open source SOTA LLMs for when chatGPT will be paid or just for fun. Any comments, inputs, rants, counter-arguments are welcome. /end of rant /r/MachineLearning https://redd.it/zstequ

Failed to load video

Open in Telegram
56
[OC] US states sorted by life expectancy, colored by Biden's share of the 2020 Presidential Election /r/dataisbeautiful https://redd.it/zslrnq

Failed to load video

Open in Telegram
64
In 2021 I created these infographics to show the vast difference between one and a trillion. I never really did anything with them so I wanted to share them here to get some feedback. https://redd.it/zs1bez @datascientology

Failed to load video

Open in Telegram
67
The Life of a Carbon Atom. /r/Infographics https://redd.it/zsn0zp

Failed to load video

Open in Telegram
67
Map of percentages of respondents who say they would fight for their country /r/MapPorn https://redd.it/zshlqt

Failed to load video

Open in Telegram
70