18:18, 18th December 2023
OWASP Top 10 for Large Language Model Applications
The OWASP GenAI Security Project is a global, open-source initiative focused on identifying and addressing security risks in generative AI technologies, including large language models and AI-driven applications, by providing guidance and tools to support secure development and deployment. Originally launched as the OWASP Top 10 for Large Language Model Applications, the project has expanded into a broader effort involving over 600 contributors from more than 18 countries and nearly 8,000 community members, offering resources, educational materials and opportunities for collaboration through meetings, contributions and community engagement. The initiative remains non-commercial, relying on community support and sponsorships to sustain its work, while continuing to update its Top 10 list as a core reference for critical vulnerabilities in LLM applications.
20:26, 31st July 2023
So you want to build your own open source ChatGPT-style chatbot…
Mozilla undertook a week-long hackathon to build an internal, open-source chatbot prototype that runs entirely on its own cloud infrastructure, using no third-party APIs or proprietary services. The team navigated multiple layers of technical decision-making, choosing the llama.cpp runtime over Hugging Face's tools due to time constraints and configuration difficulties, and ultimately selecting Facebook's LLaMA 2 model after manually evaluating several options for resistance to bias, toxicity and misinformation.
Model selection was complicated by legal restrictions, as many popular models inherit non-commercial licensing terms from the original LLaMA weights, limiting viable options. The team also implemented a custom vector search and embedding solution using Python to give the chatbot access to a small amount of internal Mozilla knowledge, while carefully crafting a system prompt to align the chatbot's behaviour with Mozilla's values and policies around inclusion and factual accuracy. LangChain was used only minimally for the embedding layer, with most orchestration handled through custom Python code, and an existing internal interface was repurposed as the front end.
The project concluded that open-source chatbot development remains complex and inconsistent, that evaluating models on trustworthiness criteria beyond raw performance is still too difficult, and that prompt engineering remains critically important, with Mozilla signalling its intention to address these challenges by contributing further to the open-source AI community.
16:42, 31st July 2023
Did ChatGPT write this? Here’s how to tell.
The rise of AI chatbots like ChatGPT has introduced new possibilities for generating information, yet also raised concerns about accuracy and the spread of misinformation. While these tools can produce detailed responses on a wide range of topics, they occasionally provide incorrect or misleading information, complicating efforts to distinguish between human and AI-generated content.
Detection methods, such as tools developed by OpenAI and academic institutions, are available but imperfect, often failing to identify shorter texts or content that has been edited by humans. Experts caution that no detection system is foolproof, as AI models continue to evolve, making it increasingly difficult to reliably identify AI-generated text.
Additionally, verifying the accuracy of ChatGPT's responses requires critical evaluation, including checking for inconsistencies, errors, or contextually inappropriate information. Organisations like Mozilla are addressing these challenges by promoting responsible AI development and emphasising the need for education on the societal and ethical implications of emerging technologies.
19:52, 4th July 2023
Voiceflow is a platform designed to enable the creation and deployment of AI-driven customer experience solutions across multiple channels, offering tools for building, testing and scaling conversational agents. It supports omnichannel integration, allowing businesses to embed AI agents into websites, mobile applications and call centres, with features such as real-time collaboration, customisable workflows and compatibility with major enterprise systems.
The platform is used by organisations to automate customer support, generate leads and improve user engagement, with case studies highlighting rapid implementation and measurable outcomes such as increased automation rates and user satisfaction. Users and industry professionals describe it as a versatile tool that simplifies the development process, facilitates teamwork and provides flexibility through visual design interfaces, API integrations and code editing capabilities, while adhering to enterprise security standards including SOC-2, ISO 27001 and GDPR compliance.
17:13, 14th June 2023
ChatGPT brings AI into popular culture
The rise of ChatGPT has sparked widespread discussion about the potential and challenges of generative AI, with users divided between enthusiasm for its ability to assist with tasks like coding and writing and concerns over its reliability, accuracy and potential to undermine professional standards. As a large language model trained on diverse data, it produces responses that often appear plausible but lack transparency in sourcing or explanation, raising issues around intellectual property and trust. While it demonstrates capability in areas such as SAS programming, its outputs can contain errors or overly complex solutions, highlighting the need for human oversight. The technology's development relies on user feedback to refine its outputs and its continued evolution may reshape how AI is integrated into both professional and educational contexts, though its current limitations underscore the importance of critical evaluation by users.
17:02, 14th June 2023
My general advice on getting an analytics job
For those looking to break into or transition to a career in analytics, a few practical approaches are worth considering. Listening to dedicated podcasts on the subject can provide collective insight from a wide range of professionals far more comprehensively than any single piece of advice could. Actively searching job listings and setting up alerts on platforms such as LinkedIn allows aspiring analysts to identify patterns in job titles, locations, industries and required skills, effectively building a picture of their ideal role. Sharing projects, reviews and tutorials online, particularly on LinkedIn, can demonstrate value even without formal professional experience in the field, and the habit becomes easier to maintain over time. Finally, working with a coach who can offer personalised guidance may prove more beneficial than generic training programmes, given that individual circumstances vary considerably.
16:27, 30th November 2022
What is Chebychev’s Theorem, and How Does it Apply to Data Science?
Chebyshev’s Theorem provides a method to estimate the proportion of data within a certain number of standard deviations from the mean in any dataset, regardless of its distribution. Unlike the Empirical Rule, which applies specifically to normal distributions and states that approximately 68%, 95% and 99.7% of data fall within one, two and three standard deviations respectively, Chebyshev’s Theorem offers a more general approach, guaranteeing that at least 75% of data lies within two standard deviations and at least 89% within three, with the exact proportion determined by the formula 1 − 1/k² for any k greater than one. This theorem is particularly valuable in data science when dealing with non-normal distributions, as it allows analysts to infer data dispersion and make probabilistic statements about observations even when the underlying distribution is unknown or skewed, complementing statistical measures like mean and standard deviation to better understand dataset characteristics.
18:22, 19th November 2022
The Complete Free PyTorch Course for Deep Learning
A comprehensive free course on PyTorch for deep learning and machine learning is available, offering 25 hours of video content alongside an online book and accompanying resources. Designed by Daniel Bourke, the course covers foundational topics such as PyTorch fundamentals, workflow processes, neural network classification, computer vision techniques and creating custom datasets, providing practical coding examples throughout. Hosted by KDnuggets, the course is aimed at learners seeking to develop proficiency in using PyTorch for deep learning applications, with materials accessible through the platform's resources.
14:50, 19th November 2022
7 Techniques to Handle Imbalanced Data
Handling imbalanced datasets, common in fields such as fraud detection and intrusion detection, requires careful consideration of evaluation metrics beyond accuracy, as models may otherwise fail to identify rare events effectively. Techniques include resampling methods like under-sampling to reduce the majority class or over-sampling to generate synthetic minority instances, though both approaches have limitations depending on data availability.
Cross-validation should be applied before resampling to avoid overfitting, while ensemble methods that combine multiple resampled datasets can improve generalisation. Adjusting the ratio of classes during resampling, clustering the majority class to retain representative samples and designing models with cost functions that prioritise the minority class are further strategies. These approaches, along with adapting algorithms like XGBoost that inherently manage class imbalance, offer practical solutions to enhance model performance in scenarios where rare events are critical to detect.
22:46, 18th November 2022
15 More Free Machine Learning and Deep Learning Books
Another compilation of 15 freely available eBooks covering machine learning and deep learning has been put together for those looking to build or deepen their knowledge in these areas. The selection spans a wide range of topics and skill levels, from foundational introductions to neural network architectures and the mathematics underpinning machine learning, through to more advanced subjects such as deep learning applied to physical simulations, graph-based data representation and the analysis of predictive models.
Authors include both academic researchers and industry practitioners, with titles from figures such as Ian Goodfellow, Yoshua Bengio and Aaron Courville, as well as Microsoft researchers Li Deng and Dong Yu. Some books take a practical, code-focused approach using tools like Jupyter notebooks and the fastai library, while others lean more heavily into theory, covering topics such as backpropagation, regularisation, natural language processing and reinforcement learning. One title is aimed specifically at those preparing for deep learning job interviews, and another offers an exceptionally thorough grounding in the mathematics relevant to computer science and machine learning.